Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingMachine Translation

wmt/wmt14

The WMT14 dataset is a multilingual dataset for machine translation tasks, containing translation pairs for several language pairs such as Czech‑English (cs‑en), German‑English (de‑en), French‑English (fr‑en), Hindi‑English (hi‑en) and Russian‑English (ru‑en). Dataset size varies from a few MB to several tens of GB depending on the language pair. The dataset comprises training, validation, and test splits; each language pair includes a `translation` field containing the source and target texts. It is built from statmt.org data and allows users to customize language pairs and data sources.

Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 3, 2024
Signals
560 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • Name: WMT14
  • Languages: Supports multiple languages, including cs, de, en, fr, hi, ru
  • License: Unknown
  • Multilinguality: Translation
  • Size: 10 M < size < 100 M

Data Sources

  • Extended Sources:
    • europarl_bilingual
    • giga_fren
    • news_commentary
    • un_multi
    • hind_encorp

Dataset Configurations

  • Configuration Names:
    • cs‑en
    • de‑en
    • fr‑en
    • hi‑en
    • ru‑en

Features

  • Feature Name: translation
  • Data Type:
    • Languages:
      • cs‑en: cs, en
      • de‑en: de, en
      • fr‑en: fr, en
      • hi‑en: hi, en
      • ru‑en: ru, en

Splits

  • Training Set:
    • cs‑en: 953,621 examples, 280,992,026 bytes
    • de‑en: 4,508,785 examples, 1,358,406,800 bytes
    • fr‑en: 40,836,715 examples, 14,752,522,252 bytes
    • hi‑en: 32,863 examples, 1,936,003 bytes
    • ru‑en: 1,486,965 examples, 433,209,078 bytes
  • Validation Set:
    • cs‑en: 3,000 examples, 702,465 bytes
    • de‑en: 3,000 examples, 736,407 bytes
    • fr‑en: 3,000 examples, 744,439 bytes
    • hi‑en: 520 examples, 181,457 bytes
    • ru‑en: 3,000 examples, 977,938 bytes
  • Test Set:
    • cs‑en: 3,003 examples, 757,809 bytes
    • de‑en: 3,003 examples, 777,326 bytes
    • fr‑en: 3,003 examples, 838,849 bytes
    • hi‑en: 2,507 examples, 1,075,008 bytes
    • ru‑en: 3,003 examples, 1,087,738 bytes

Download and Dataset Size

  • Download Size:
    • cs‑en: 168,878,237 bytes
    • de‑en: 818,467,512 bytes
    • fr‑en: 7,777,527,744 bytes
    • hi‑en: 1,583,004 bytes
    • ru‑en: 223,537,244 bytes
  • Dataset Size:
    • cs‑en: 282,452,300 bytes
    • de‑en: 1,359,920,533 bytes
    • fr‑en: 14,754,105,540 bytes
    • hi‑en: 3,192,468 bytes
    • ru‑en: 435,274,754 bytes

Data File Configuration

  • Configuration Names:
    • cs‑en
    • de‑en
    • fr‑en
    • hi‑en
    • ru‑en
  • Data File Paths:
    • Training: <language‑pair>/train-*
    • Validation: <language‑pair>/validation-*
    • Test: <language‑pair>/test-*
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio