wmt/wmt14
The WMT14 dataset is a multilingual dataset for machine translation tasks, containing translation pairs for several language pairs such as Czech‑English (cs‑en), German‑English (de‑en), French‑English (fr‑en), Hindi‑English (hi‑en) and Russian‑English (ru‑en). Dataset size varies from a few MB to several tens of GB depending on the language pair. The dataset comprises training, validation, and test splits; each language pair includes a `translation` field containing the source and target texts. It is built from statmt.org data and allows users to customize language pairs and data sources.
Description
Dataset Overview
Basic Information
- Name: WMT14
- Languages: Supports multiple languages, including cs, de, en, fr, hi, ru
- License: Unknown
- Multilinguality: Translation
- Size: 10 M < size < 100 M
Data Sources
- Extended Sources:
- europarl_bilingual
- giga_fren
- news_commentary
- un_multi
- hind_encorp
Dataset Configurations
- Configuration Names:
- cs‑en
- de‑en
- fr‑en
- hi‑en
- ru‑en
Features
- Feature Name: translation
- Data Type:
- Languages:
- cs‑en: cs, en
- de‑en: de, en
- fr‑en: fr, en
- hi‑en: hi, en
- ru‑en: ru, en
- Languages:
Splits
- Training Set:
- cs‑en: 953,621 examples, 280,992,026 bytes
- de‑en: 4,508,785 examples, 1,358,406,800 bytes
- fr‑en: 40,836,715 examples, 14,752,522,252 bytes
- hi‑en: 32,863 examples, 1,936,003 bytes
- ru‑en: 1,486,965 examples, 433,209,078 bytes
- Validation Set:
- cs‑en: 3,000 examples, 702,465 bytes
- de‑en: 3,000 examples, 736,407 bytes
- fr‑en: 3,000 examples, 744,439 bytes
- hi‑en: 520 examples, 181,457 bytes
- ru‑en: 3,000 examples, 977,938 bytes
- Test Set:
- cs‑en: 3,003 examples, 757,809 bytes
- de‑en: 3,003 examples, 777,326 bytes
- fr‑en: 3,003 examples, 838,849 bytes
- hi‑en: 2,507 examples, 1,075,008 bytes
- ru‑en: 3,003 examples, 1,087,738 bytes
Download and Dataset Size
- Download Size:
- cs‑en: 168,878,237 bytes
- de‑en: 818,467,512 bytes
- fr‑en: 7,777,527,744 bytes
- hi‑en: 1,583,004 bytes
- ru‑en: 223,537,244 bytes
- Dataset Size:
- cs‑en: 282,452,300 bytes
- de‑en: 1,359,920,533 bytes
- fr‑en: 14,754,105,540 bytes
- hi‑en: 3,192,468 bytes
- ru‑en: 435,274,754 bytes
Data File Configuration
- Configuration Names:
- cs‑en
- de‑en
- fr‑en
- hi‑en
- ru‑en
- Data File Paths:
- Training:
<language‑pair>/train-* - Validation:
<language‑pair>/validation-* - Test:
<language‑pair>/test-*
- Training:
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.