wmt/wmt20_mlqe_task1
This dataset is part of the WMT20 Multilingual Quality Estimation (MLQE) task, used to evaluate the quality of neural machine translation outputs without reference translations. It includes translation pairs for several language directions (e.g., en‑de, en‑zh) sourced from Wikipedia and Reddit. Each sentence is annotated with Direct Assessment (DA) scores ranging from 0 to 100 by professional translators. The dataset is split into training, validation, and test sets (7 k training, 1 k validation, 1 k test per configuration) and is intended for research on automatic quality estimation of NMT systems.
Dataset description and usage context
Dataset Overview
Dataset Name
- Name: WMT20 – MultiLingual Quality Estimation (MLQE) Task 1
- Alias: MLQE‑Task1
Summary
- Purpose: Evaluate neural‑machine‑translation output quality without reference translations.
- Content: Multilingual translation data primarily from Wikipedia, with some from Reddit.
- Languages: German, English, Estonian, Nepali, Romanian, Russian, Sinhalese, Chinese.
Structure
- Configurations: en‑de, en‑zh, et‑en, ne‑en, ro‑en, ru‑en, si‑en
- Features: segid, translation, scores, mean, z_scores, z_mean, model_score, doc_id, nmt_output, word_probas.
- Splits: train (7 k), validation (1 k), test (1 k) per configuration.
Creation
- Source: Wikipedia and Reddit; translated using fairseq NMT models; scored by professional translators using Direct Assessment.
- Scoring: Each sentence receives at least three DA scores (0‑100).
Usage Considerations
- License: Unknown.
- Metrics: Pearson correlation between predicted scores and human DA.
Additional Information
- Contributors: Thanks to @VictorSanh for adding the dataset.
Detailed File Information
- File Sizes: (bytes) – en‑de: 4 539 012, en‑zh: 4 269 820, etc.
- Download Sizes: (bytes) – en‑de: 3 293 699, en‑zh: 3 325 683, etc.
Feature Definitions
- segid: int32 – segment identifier
- translation: string – source and target text
- scores: float32 – list of DA scores
- mean: float32 – average score
- z_scores: float32 – z‑standardized scores
- z_mean: float32 – mean of z‑scores
- model_score: float32 – model‑predicted score
- doc_id: string – document identifier
- nmt_output: string – NMT system output
- word_probas: float32 – word‑level probabilities
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.