wmt/wmt20_mlqe_task1
This dataset is part of the WMT20 Multilingual Quality Estimation (MLQE) task, used to evaluate the quality of neural machine translation outputs without reference translations. It includes translation pairs for several language directions (e.g., en‑de, en‑zh) sourced from Wikipedia and Reddit. Each sentence is annotated with Direct Assessment (DA) scores ranging from 0 to 100 by professional translators. The dataset is split into training, validation, and test sets (7 k training, 1 k validation, 1 k test per configuration) and is intended for research on automatic quality estimation of NMT systems.
Description
Dataset Overview
Dataset Name
- Name: WMT20 – MultiLingual Quality Estimation (MLQE) Task 1
- Alias: MLQE‑Task1
Summary
- Purpose: Evaluate neural‑machine‑translation output quality without reference translations.
- Content: Multilingual translation data primarily from Wikipedia, with some from Reddit.
- Languages: German, English, Estonian, Nepali, Romanian, Russian, Sinhalese, Chinese.
Structure
- Configurations: en‑de, en‑zh, et‑en, ne‑en, ro‑en, ru‑en, si‑en
- Features: segid, translation, scores, mean, z_scores, z_mean, model_score, doc_id, nmt_output, word_probas.
- Splits: train (7 k), validation (1 k), test (1 k) per configuration.
Creation
- Source: Wikipedia and Reddit; translated using fairseq NMT models; scored by professional translators using Direct Assessment.
- Scoring: Each sentence receives at least three DA scores (0‑100).
Usage Considerations
- License: Unknown.
- Metrics: Pearson correlation between predicted scores and human DA.
Additional Information
- Contributors: Thanks to @VictorSanh for adding the dataset.
Detailed File Information
- File Sizes: (bytes) – en‑de: 4 539 012, en‑zh: 4 269 820, etc.
- Download Sizes: (bytes) – en‑de: 3 293 699, en‑zh: 3 325 683, etc.
Feature Definitions
- segid: int32 – segment identifier
- translation: string – source and target text
- scores: float32 – list of DA scores
- mean: float32 – average score
- z_scores: float32 – z‑standardized scores
- z_mean: float32 – mean of z‑scores
- model_score: float32 – model‑predicted score
- doc_id: string – document identifier
- nmt_output: string – NMT system output
- word_probas: float32 – word‑level probabilities
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.