Multi30k Dataset
The Multi30k dataset is a multilingual English‑German image description dataset, containing training, validation, and test sets, and supporting multiple languages such as English, German, French, and Czech. The dataset provides detailed statistics such as the number of sentences, word count, and average words per sentence. Additionally, it offers download links for visual features and original images.
Description
Dataset Overview
Name: Multi30k Data Repository
Data Structure:
- Task 1:
- Raw files: located at
data/task1/raw - Tokenized files: located at
data/task1/tok, preprocessed using the scriptscripts/task1-tokenize.sh
- Raw files: located at
Dataset Statistics:
- Training Set:
- English (en): 29,000 sentences, 377,534 words, average 13.0 words per sentence
- German (de): 29,000 sentences, 360,706 words, average 12.4 words per sentence
- French (fr): 29,000 sentences, 409,845 words, average 14.1 words per sentence
- Czech (cs): 29,000 sentences, 297,212 words, average 10.2 words per sentence
- Validation Set:
- English (en): 1,014 sentences, 13,308 words, average 13.1 words per sentence
- German (de): 1,014 sentences, 12,828 words, average 12.7 words per sentence
- French (fr): 1,014 sentences, 14,381 words, average 14.2 words per sentence
- Czech (cs): 1,014 sentences, 10,342 words, average 10.2 words per sentence
- Test Set:
- 2016 Flickr:
- English (en): 1,000 sentences, 12,968 words, average 13.0 words per sentence
- German (de): 1,000 sentences, 12,103 words, average 12.1 words per sentence
- French (fr): 1,000 sentences, 13,988 words, average 14.0 words per sentence
- Czech (cs): 1,000 sentences, 10,497 words, average 10.5 words per sentence
- 2017 Flickr:
- English (en): 1,000 sentences, 11,376 words, average 11.4 words per sentence
- German (de): 1,000 sentences, 10,758 words, average 10.8 words per sentence
- French (fr): 1,000 sentences, 12,596 words, average 12.6 words per sentence
- 2017 MSCOCO:
- English (en): 461 sentences, 5,239 words, average 11.4 words per sentence
- German (de): 461 sentences, 5,158 words, average 11.2 words per sentence
- French (fr): 461 sentences, 5,710 words, average 12.4 words per sentence
- 2016 Flickr:
Evaluation:
- Models can be evaluated on the 2018 test set via the Codalab competition.
Visual Features:
- Pre‑extracted visual features can be downloaded from Google Drive.
- Original images are available on request via this link.
Citation:
-
English and German data:
@InProceedings{W16-3210, author = "Elliott, Desmond and Frank, Stella and Simaan, Khalil and Specia, Lucia", title = "Multi30K: Multilingual English‑German Image Descriptions", booktitle = "Proceedings of the 5th Workshop on Vision and Language", year = "2016", publisher = "Association for Computational Linguistics", pages = "70--74", location = "Berlin, Germany", doi = "10.18653/v1/W16-3210", url = "http://www.aclweb.org/anthology/W16-3210" }
-
French data, fuzzy COCO evaluation data and 2017 test data:
@InProceedings{elliott-EtAl:2017:WMT, author = {Elliott, Desmond and Frank, Stella and Barrault, Lo"{i}c and Bougares, Fethi and Specia, Lucia}, title = {Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description}, booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers}, month = {September}, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, pages = {215--233}, url = {http://www.aclweb.org/anthology/W17-4718} }
-
Czech data:
@inproceedings{barrault2018findings, title={Findings of the Third Shared Task on Multimodal Machine Translation}, author={Barrault, Lo{"i}c and Bougares, Fethi and Specia, Lucia and Lala, Chiraag and Elliott, Desmond and Frank, Stella}, booktitle={Proceedings of the Third Conference on Machine Translation: Shared Task Papers}, pages={304--323}, year={2018} }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 11/13/2019
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.