wmt/wmt16
This is a translation dataset based on statmt.org data, supporting multiple language pairs including cs‑en, de‑en, fi‑en, ro‑en, ru‑en and tr‑en. The dataset size ranges from 10 MB to 100 MB and is primarily used for translation tasks. The creators did not provide annotations; the data originates from several extended corpora such as europarl_bilingual, news_commentary, setimes and un_multi. The download size is 1.69 GB, the generated dataset size is 297.28 MB, and total disk usage is 1.99 GB.
Description
Dataset Overview
Dataset Name: WMT16
Dataset ID: wmt-2016
Languages: Supports multiple languages, including Czech (cs), German (de), English (en), Finnish (fi), Romanian (ro), Russian (ru), Turkish (tr).
License Information: Unknown
Multilinguality: Translation
Size Category: 10 M < size < 100 M
Source Datasets: Extended from multiple datasets, including europarl_bilingual, news_commentary, setimes, un_multi.
Task Category: Translation
Dataset Structure
Configurations and Features
- Configuration Names: cs-en, de-en, fi-en, ro-en, ru-en, tr-en
- Features: Each configuration contains a feature named
translationof typestring, covering the two languages of the configuration.
Data Splits
| Configuration | Split Name | Bytes | Example Count |
|---|---|---|---|
| cs-en | train | 295995226 | 997240 |
| cs-en | validation | 572195 | 2656 |
| cs-en | test | 707862 | 2999 |
| de-en | train | 1373099816 | 4548885 |
| de-en | validation | 522981 | 2169 |
| de-en | test | 735508 | 2999 |
| fi-en | train | 605145153 | 2073394 |
| fi-en | validation | 306327 | 1370 |
| fi-en | test | 1410507 | 6000 |
| ro-en | train | 188287711 | 610320 |
| ro-en | validation | 561791 | 1999 |
| ro-en | test | 539208 | 1999 |
| ru-en | train | 448322024 | 1516162 |
| ru-en | validation | 955964 | 2818 |
| ru-en | test | 1050669 | 2998 |
| tr-en | train | 60416449 | 205756 |
| tr-en | validation | 240642 | 1001 |
| tr-en | test | 732428 | 3000 |
Download and Dataset Sizes
| Configuration | Download Size (bytes) | Dataset Size (bytes) |
|---|---|---|
| cs-en | 178250444 | 297275283 |
| de-en | 827152589 | 1374358305 |
| fi-en | 348306427 | 606861987 |
| ro-en | 108584039 | 189388710 |
| ru-en | 231557371 | 450328657 |
| tr-en | 37389436 | 61389519 |
Dataset Creation
Source Data: The dataset is extended from multiple source datasets, including europarl_bilingual, news_commentary, setimes, un_multi.
Annotations: None.
Language Creators: Languages were created by the discoverers.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.