JUHE API Marketplace
DATASET
Open Source Community

wmt/wmt16

This is a translation dataset based on statmt.org data, supporting multiple language pairs including cs‑en, de‑en, fi‑en, ro‑en, ru‑en and tr‑en. The dataset size ranges from 10 MB to 100 MB and is primarily used for translation tasks. The creators did not provide annotations; the data originates from several extended corpora such as europarl_bilingual, news_commentary, setimes and un_multi. The download size is 1.69 GB, the generated dataset size is 297.28 MB, and total disk usage is 1.99 GB.

Updated 4/3/2024
hugging_face

Description

Dataset Overview

Dataset Name: WMT16

Dataset ID: wmt-2016

Languages: Supports multiple languages, including Czech (cs), German (de), English (en), Finnish (fi), Romanian (ro), Russian (ru), Turkish (tr).

License Information: Unknown

Multilinguality: Translation

Size Category: 10 M < size < 100 M

Source Datasets: Extended from multiple datasets, including europarl_bilingual, news_commentary, setimes, un_multi.

Task Category: Translation

Dataset Structure

Configurations and Features

  • Configuration Names: cs-en, de-en, fi-en, ro-en, ru-en, tr-en
  • Features: Each configuration contains a feature named translation of type string, covering the two languages of the configuration.

Data Splits

ConfigurationSplit NameBytesExample Count
cs-entrain295995226997240
cs-envalidation5721952656
cs-entest7078622999
de-entrain13730998164548885
de-envalidation5229812169
de-entest7355082999
fi-entrain6051451532073394
fi-envalidation3063271370
fi-entest14105076000
ro-entrain188287711610320
ro-envalidation5617911999
ro-entest5392081999
ru-entrain4483220241516162
ru-envalidation9559642818
ru-entest10506692998
tr-entrain60416449205756
tr-envalidation2406421001
tr-entest7324283000

Download and Dataset Sizes

ConfigurationDownload Size (bytes)Dataset Size (bytes)
cs-en178250444297275283
de-en8271525891374358305
fi-en348306427606861987
ro-en108584039189388710
ru-en231557371450328657
tr-en3738943661389519

Dataset Creation

Source Data: The dataset is extended from multiple source datasets, including europarl_bilingual, news_commentary, setimes, un_multi.

Annotations: None.

Language Creators: Languages were created by the discoverers.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Machine Translation
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.