JUHE API Marketplace
DATASET
Open Source Community

wmt/wmt18

The WMT18 dataset is a multilingual machine‑translation corpus containing parallel data for many language pairs, such as Czech‑English, German‑English, Estonian‑English, Finnish‑English, Kazakh‑English, Russian‑English, Turkish‑English, and Chinese‑English. The dataset is divided into training, validation, and test splits, with varying sizes per language pair. Sources include Europarl, News Commentary, OPUS ParaCrawl, SETimes, and UN Multi. Its purpose is to support MT research, allowing users to select arbitrary language pairs and subsets to create custom corpora.

Updated 4/3/2024
hugging_face

Description

Dataset Overview

Dataset Name: WMT18

Dataset ID: wmt-2018

Languages: Supports multiple languages, including cs, de, en, et, fi, kk, ru, tr, zh.

License: Unknown

Multilinguality: Designed for translation tasks

Size Category: 10M<n<100M

Source Datasets: Includes europarl_bilingual, news_commentary, opus_paracrawl, setimes, un_multi (all extended).

Task Category: Translation

Structure

Configuration Names and Language Pairs:

  • cs‑en: Czech & English
  • de‑en: German & English
  • et‑en: Estonian & English
  • fi‑en: Finnish & English
  • kk‑en: Kazakh & English
  • ru‑en: Russian & English
  • tr‑en: Turkish & English
  • zh‑en: Chinese & English

Size and Split Details:

  • cs‑en:

    • Train: 11 046 024 examples, 1 461 007 346 bytes
    • Validation: 3 005 examples, 674 422 bytes
    • Test: 2 983 examples, 696 221 bytes
    • Download size: 738 874 648 bytes
    • Dataset size: 1 462 377 989 bytes
  • de‑en:

    • Train: 42 271 874 examples, 8 187 518 284 bytes
    • Validation: 3 004 examples, 729 511 bytes
    • Test: 2 998 examples, 757 641 bytes
    • Download size: 4 436 297 213 bytes
    • Dataset size: 8 189 005 436 bytes
  • et‑en:

    • Train: 2 175 873 examples, 647 990 923 bytes
    • Validation: 2 000 examples, 459 390 bytes
    • Test: 2 000 examples, 489 386 bytes
    • Download size: 283 931 426 bytes
    • Dataset size: 648 939 699 bytes
  • fi‑en:

    • Train: 3 280 600 examples, 857 169 249 bytes
    • Validation: 6 004 examples, 1 388 820 bytes
    • Test: 3 000 examples, 691 833 bytes
    • Download size: 488 708 706 bytes
    • Dataset size: 859 249 902 bytes
  • kk‑en: No data provided (download size 0, dataset size 0).

  • ru‑en:

    • Train: 36 858 512 examples, 13 665 338 159 bytes
    • Validation: 3 001 examples, 1 040 187 bytes
    • Test: 3 000 examples, 1 085 588 bytes
    • Download size: 6 130 744 133 bytes
    • Dataset size: 13 667 463 934 bytes
  • tr‑en:

    • Train: 205 756 examples, 60 416 449 bytes
    • Validation: 3 007 examples, 752 765 bytes
    • Test: 3 000 examples, 770 305 bytes
    • Download size: 37 733 844 bytes
    • Dataset size: 61 939 519 bytes
  • zh‑en:

    • Train: 25 160 346 examples, 6 342 987 000 bytes
    • Validation: 2 001 examples, 540 339 bytes
    • Test: 3 981 examples, 1 107 514 bytes
    • Download size: 3 581 074 494 bytes
    • Dataset size: 6 344 634 853 bytes

Download and Configuration

For each language pair, the configuration provides file paths such as:

  • cs‑en:
    • Train: cs‑en/train-*
    • Validation: cs‑en/validation-*
    • Test: cs‑en/test-*

Similar patterns apply to de‑en, et‑en, fi‑en, ru‑en, tr‑en, and zh‑en.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Machine Translation
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.