JUHE API Marketplace
DATASET
Open Source Community

Multi30k Dataset

The Multi30k dataset is a multilingual English‑German image description dataset, containing training, validation, and test sets, and supporting multiple languages such as English, German, French, and Czech. The dataset provides detailed statistics such as the number of sentences, word count, and average words per sentence. Additionally, it offers download links for visual features and original images.

Updated 11/22/2019
github

Description

Dataset Overview

Name: Multi30k Data Repository

Data Structure:

  • Task 1:
    • Raw files: located at data/task1/raw
    • Tokenized files: located at data/task1/tok, preprocessed using the script scripts/task1-tokenize.sh

Dataset Statistics:

  • Training Set:
    • English (en): 29,000 sentences, 377,534 words, average 13.0 words per sentence
    • German (de): 29,000 sentences, 360,706 words, average 12.4 words per sentence
    • French (fr): 29,000 sentences, 409,845 words, average 14.1 words per sentence
    • Czech (cs): 29,000 sentences, 297,212 words, average 10.2 words per sentence
  • Validation Set:
    • English (en): 1,014 sentences, 13,308 words, average 13.1 words per sentence
    • German (de): 1,014 sentences, 12,828 words, average 12.7 words per sentence
    • French (fr): 1,014 sentences, 14,381 words, average 14.2 words per sentence
    • Czech (cs): 1,014 sentences, 10,342 words, average 10.2 words per sentence
  • Test Set:
    • 2016 Flickr:
      • English (en): 1,000 sentences, 12,968 words, average 13.0 words per sentence
      • German (de): 1,000 sentences, 12,103 words, average 12.1 words per sentence
      • French (fr): 1,000 sentences, 13,988 words, average 14.0 words per sentence
      • Czech (cs): 1,000 sentences, 10,497 words, average 10.5 words per sentence
    • 2017 Flickr:
      • English (en): 1,000 sentences, 11,376 words, average 11.4 words per sentence
      • German (de): 1,000 sentences, 10,758 words, average 10.8 words per sentence
      • French (fr): 1,000 sentences, 12,596 words, average 12.6 words per sentence
    • 2017 MSCOCO:
      • English (en): 461 sentences, 5,239 words, average 11.4 words per sentence
      • German (de): 461 sentences, 5,158 words, average 11.2 words per sentence
      • French (fr): 461 sentences, 5,710 words, average 12.4 words per sentence

Evaluation:

  • Models can be evaluated on the 2018 test set via the Codalab competition.

Visual Features:

  • Pre‑extracted visual features can be downloaded from Google Drive.
  • Original images are available on request via this link.

Citation:

  • English and German data:

    @InProceedings{W16-3210, author = "Elliott, Desmond and Frank, Stella and Simaan, Khalil and Specia, Lucia", title = "Multi30K: Multilingual English‑German Image Descriptions", booktitle = "Proceedings of the 5th Workshop on Vision and Language", year = "2016", publisher = "Association for Computational Linguistics", pages = "70--74", location = "Berlin, Germany", doi = "10.18653/v1/W16-3210", url = "http://www.aclweb.org/anthology/W16-3210" }

  • French data, fuzzy COCO evaluation data and 2017 test data:

    @InProceedings{elliott-EtAl:2017:WMT, author = {Elliott, Desmond and Frank, Stella and Barrault, Lo"{i}c and Bougares, Fethi and Specia, Lucia}, title = {Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description}, booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers}, month = {September}, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, pages = {215--233}, url = {http://www.aclweb.org/anthology/W17-4718} }

  • Czech data:

    @inproceedings{barrault2018findings, title={Findings of the Third Shared Task on Multimodal Machine Translation}, author={Barrault, Lo{"i}c and Bougares, Fethi and Specia, Lucia and Lala, Chiraag and Elliott, Desmond and Frank, Stella}, booktitle={Proceedings of the Third Conference on Machine Translation: Shared Task Papers}, pages={304--323}, year={2018} }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Multilingual Image Description
Machine Learning

Source

Organization: github

Created: 11/13/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.