wmt/wmt18
The WMT18 dataset is a multilingual machine‑translation corpus containing parallel data for many language pairs, such as Czech‑English, German‑English, Estonian‑English, Finnish‑English, Kazakh‑English, Russian‑English, Turkish‑English, and Chinese‑English. The dataset is divided into training, validation, and test splits, with varying sizes per language pair. Sources include Europarl, News Commentary, OPUS ParaCrawl, SETimes, and UN Multi. Its purpose is to support MT research, allowing users to select arbitrary language pairs and subsets to create custom corpora.
Description
Dataset Overview
Dataset Name: WMT18
Dataset ID: wmt-2018
Languages: Supports multiple languages, including cs, de, en, et, fi, kk, ru, tr, zh.
License: Unknown
Multilinguality: Designed for translation tasks
Size Category: 10M<n<100M
Source Datasets: Includes europarl_bilingual, news_commentary, opus_paracrawl, setimes, un_multi (all extended).
Task Category: Translation
Structure
Configuration Names and Language Pairs:
- cs‑en: Czech & English
- de‑en: German & English
- et‑en: Estonian & English
- fi‑en: Finnish & English
- kk‑en: Kazakh & English
- ru‑en: Russian & English
- tr‑en: Turkish & English
- zh‑en: Chinese & English
Size and Split Details:
-
cs‑en:
- Train: 11 046 024 examples, 1 461 007 346 bytes
- Validation: 3 005 examples, 674 422 bytes
- Test: 2 983 examples, 696 221 bytes
- Download size: 738 874 648 bytes
- Dataset size: 1 462 377 989 bytes
-
de‑en:
- Train: 42 271 874 examples, 8 187 518 284 bytes
- Validation: 3 004 examples, 729 511 bytes
- Test: 2 998 examples, 757 641 bytes
- Download size: 4 436 297 213 bytes
- Dataset size: 8 189 005 436 bytes
-
et‑en:
- Train: 2 175 873 examples, 647 990 923 bytes
- Validation: 2 000 examples, 459 390 bytes
- Test: 2 000 examples, 489 386 bytes
- Download size: 283 931 426 bytes
- Dataset size: 648 939 699 bytes
-
fi‑en:
- Train: 3 280 600 examples, 857 169 249 bytes
- Validation: 6 004 examples, 1 388 820 bytes
- Test: 3 000 examples, 691 833 bytes
- Download size: 488 708 706 bytes
- Dataset size: 859 249 902 bytes
-
kk‑en: No data provided (download size 0, dataset size 0).
-
ru‑en:
- Train: 36 858 512 examples, 13 665 338 159 bytes
- Validation: 3 001 examples, 1 040 187 bytes
- Test: 3 000 examples, 1 085 588 bytes
- Download size: 6 130 744 133 bytes
- Dataset size: 13 667 463 934 bytes
-
tr‑en:
- Train: 205 756 examples, 60 416 449 bytes
- Validation: 3 007 examples, 752 765 bytes
- Test: 3 000 examples, 770 305 bytes
- Download size: 37 733 844 bytes
- Dataset size: 61 939 519 bytes
-
zh‑en:
- Train: 25 160 346 examples, 6 342 987 000 bytes
- Validation: 2 001 examples, 540 339 bytes
- Test: 3 981 examples, 1 107 514 bytes
- Download size: 3 581 074 494 bytes
- Dataset size: 6 344 634 853 bytes
Download and Configuration
For each language pair, the configuration provides file paths such as:
- cs‑en:
- Train: cs‑en/train-*
- Validation: cs‑en/validation-*
- Test: cs‑en/test-*
Similar patterns apply to de‑en, et‑en, fi‑en, ru‑en, tr‑en, and zh‑en.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.