JUHE API Marketplace
DATASET
Open Source Community

wmt/wmt14

The WMT14 dataset is a multilingual dataset for machine translation tasks, containing translation pairs for several language pairs such as Czech‑English (cs‑en), German‑English (de‑en), French‑English (fr‑en), Hindi‑English (hi‑en) and Russian‑English (ru‑en). Dataset size varies from a few MB to several tens of GB depending on the language pair. The dataset comprises training, validation, and test splits; each language pair includes a `translation` field containing the source and target texts. It is built from statmt.org data and allows users to customize language pairs and data sources.

Updated 4/3/2024
hugging_face

Description

Dataset Overview

Basic Information

  • Name: WMT14
  • Languages: Supports multiple languages, including cs, de, en, fr, hi, ru
  • License: Unknown
  • Multilinguality: Translation
  • Size: 10 M < size < 100 M

Data Sources

  • Extended Sources:
    • europarl_bilingual
    • giga_fren
    • news_commentary
    • un_multi
    • hind_encorp

Dataset Configurations

  • Configuration Names:
    • cs‑en
    • de‑en
    • fr‑en
    • hi‑en
    • ru‑en

Features

  • Feature Name: translation
  • Data Type:
    • Languages:
      • cs‑en: cs, en
      • de‑en: de, en
      • fr‑en: fr, en
      • hi‑en: hi, en
      • ru‑en: ru, en

Splits

  • Training Set:
    • cs‑en: 953,621 examples, 280,992,026 bytes
    • de‑en: 4,508,785 examples, 1,358,406,800 bytes
    • fr‑en: 40,836,715 examples, 14,752,522,252 bytes
    • hi‑en: 32,863 examples, 1,936,003 bytes
    • ru‑en: 1,486,965 examples, 433,209,078 bytes
  • Validation Set:
    • cs‑en: 3,000 examples, 702,465 bytes
    • de‑en: 3,000 examples, 736,407 bytes
    • fr‑en: 3,000 examples, 744,439 bytes
    • hi‑en: 520 examples, 181,457 bytes
    • ru‑en: 3,000 examples, 977,938 bytes
  • Test Set:
    • cs‑en: 3,003 examples, 757,809 bytes
    • de‑en: 3,003 examples, 777,326 bytes
    • fr‑en: 3,003 examples, 838,849 bytes
    • hi‑en: 2,507 examples, 1,075,008 bytes
    • ru‑en: 3,003 examples, 1,087,738 bytes

Download and Dataset Size

  • Download Size:
    • cs‑en: 168,878,237 bytes
    • de‑en: 818,467,512 bytes
    • fr‑en: 7,777,527,744 bytes
    • hi‑en: 1,583,004 bytes
    • ru‑en: 223,537,244 bytes
  • Dataset Size:
    • cs‑en: 282,452,300 bytes
    • de‑en: 1,359,920,533 bytes
    • fr‑en: 14,754,105,540 bytes
    • hi‑en: 3,192,468 bytes
    • ru‑en: 435,274,754 bytes

Data File Configuration

  • Configuration Names:
    • cs‑en
    • de‑en
    • fr‑en
    • hi‑en
    • ru‑en
  • Data File Paths:
    • Training: <language‑pair>/train-*
    • Validation: <language‑pair>/validation-*
    • Test: <language‑pair>/test-*

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Machine Translation
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.