wmt/wmt16
This is a translation dataset based on statmt.org data, supporting multiple language pairs including cs‑en, de‑en, fi‑en, ro‑en, ru‑en and tr‑en. The dataset size ranges from 10 MB to 100 MB and is primarily used for translation tasks. The creators did not provide annotations; the data originates from several extended corpora such as europarl_bilingual, news_commentary, setimes and un_multi. The download size is 1.69 GB, the generated dataset size is 297.28 MB, and total disk usage is 1.99 GB.
Dataset description and usage context
Dataset Overview
Dataset Name: WMT16
Dataset ID: wmt-2016
Languages: Supports multiple languages, including Czech (cs), German (de), English (en), Finnish (fi), Romanian (ro), Russian (ru), Turkish (tr).
License Information: Unknown
Multilinguality: Translation
Size Category: 10 M < size < 100 M
Source Datasets: Extended from multiple datasets, including europarl_bilingual, news_commentary, setimes, un_multi.
Task Category: Translation
Dataset Structure
Configurations and Features
- Configuration Names: cs-en, de-en, fi-en, ro-en, ru-en, tr-en
- Features: Each configuration contains a feature named
translationof typestring, covering the two languages of the configuration.
Data Splits
| Configuration | Split Name | Bytes | Example Count |
|---|---|---|---|
| cs-en | train | 295995226 | 997240 |
| cs-en | validation | 572195 | 2656 |
| cs-en | test | 707862 | 2999 |
| de-en | train | 1373099816 | 4548885 |
| de-en | validation | 522981 | 2169 |
| de-en | test | 735508 | 2999 |
| fi-en | train | 605145153 | 2073394 |
| fi-en | validation | 306327 | 1370 |
| fi-en | test | 1410507 | 6000 |
| ro-en | train | 188287711 | 610320 |
| ro-en | validation | 561791 | 1999 |
| ro-en | test | 539208 | 1999 |
| ru-en | train | 448322024 | 1516162 |
| ru-en | validation | 955964 | 2818 |
| ru-en | test | 1050669 | 2998 |
| tr-en | train | 60416449 | 205756 |
| tr-en | validation | 240642 | 1001 |
| tr-en | test | 732428 | 3000 |
Download and Dataset Sizes
| Configuration | Download Size (bytes) | Dataset Size (bytes) |
|---|---|---|
| cs-en | 178250444 | 297275283 |
| de-en | 827152589 | 1374358305 |
| fi-en | 348306427 | 606861987 |
| ro-en | 108584039 | 189388710 |
| ru-en | 231557371 | 450328657 |
| tr-en | 37389436 | 61389519 |
Dataset Creation
Source Data: The dataset is extended from multiple source datasets, including europarl_bilingual, news_commentary, setimes, un_multi.
Annotations: None.
Language Creators: Languages were created by the discoverers.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.