Dataset assetOpen Source CommunityNatural Language ProcessingMachine Translation

wmt/wmt16

This is a translation dataset based on statmt.org data, supporting multiple language pairs including cs‑en, de‑en, fi‑en, ro‑en, ru‑en and tr‑en. The dataset size ranges from 10 MB to 100 MB and is primarily used for translation tasks. The creators did not provide annotations; the data originates from several extended corpora such as europarl_bilingual, news_commentary, setimes and un_multi. The download size is 1.69 GB, the generated dataset size is 297.28 MB, and total disk usage is 1.99 GB.

Source

hugging_face

Created

Nov 28, 2025

Updated

Apr 3, 2024

Signals

216 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name: WMT16

Dataset ID: wmt-2016

Languages: Supports multiple languages, including Czech (cs), German (de), English (en), Finnish (fi), Romanian (ro), Russian (ru), Turkish (tr).

License Information: Unknown

Multilinguality: Translation

Size Category: 10 M < size < 100 M

Source Datasets: Extended from multiple datasets, including europarl_bilingual, news_commentary, setimes, un_multi.

Task Category: Translation

Dataset Structure

Configurations and Features

Configuration Names: cs-en, de-en, fi-en, ro-en, ru-en, tr-en
Features: Each configuration contains a feature named translation of type string, covering the two languages of the configuration.

Data Splits

Configuration	Split Name	Bytes	Example Count
cs-en	train	295995226	997240
cs-en	validation	572195	2656
cs-en	test	707862	2999
de-en	train	1373099816	4548885
de-en	validation	522981	2169
de-en	test	735508	2999
fi-en	train	605145153	2073394
fi-en	validation	306327	1370
fi-en	test	1410507	6000
ro-en	train	188287711	610320
ro-en	validation	561791	1999
ro-en	test	539208	1999
ru-en	train	448322024	1516162
ru-en	validation	955964	2818
ru-en	test	1050669	2998
tr-en	train	60416449	205756
tr-en	validation	240642	1001
tr-en	test	732428	3000

Download and Dataset Sizes

Configuration	Download Size (bytes)	Dataset Size (bytes)
cs-en	178250444	297275283
de-en	827152589	1374358305
fi-en	348306427	606861987
ro-en	108584039	189388710
ru-en	231557371	450328657
tr-en	37389436	61389519

Dataset Creation

Source Data: The dataset is extended from multiple source datasets, including europarl_bilingual, news_commentary, setimes, un_multi.

Annotations: None.

Language Creators: Languages were created by the discoverers.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio