wmt/wmt20_mlqe_task1

This dataset is part of the WMT20 Multilingual Quality Estimation (MLQE) task, used to evaluate the quality of neural machine translation outputs without reference translations. It includes translation pairs for several language directions (e.g., en‑de, en‑zh) sourced from Wikipedia and Reddit. Each sentence is annotated with Direct Assessment (DA) scores ranging from 0 to 100 by professional translators. The dataset is split into training, validation, and test sets (7 k training, 1 k validation, 1 k test per configuration) and is intended for research on automatic quality estimation of NMT systems.

Updated 4/4/2024

hugging_face

Description

Dataset Overview

Dataset Name

Name: WMT20 – MultiLingual Quality Estimation (MLQE) Task 1
Alias: MLQE‑Task1

Summary

Purpose: Evaluate neural‑machine‑translation output quality without reference translations.
Content: Multilingual translation data primarily from Wikipedia, with some from Reddit.
Languages: German, English, Estonian, Nepali, Romanian, Russian, Sinhalese, Chinese.

Structure

Configurations: en‑de, en‑zh, et‑en, ne‑en, ro‑en, ru‑en, si‑en
Features: segid, translation, scores, mean, z_scores, z_mean, model_score, doc_id, nmt_output, word_probas.
Splits: train (7 k), validation (1 k), test (1 k) per configuration.

Creation

Source: Wikipedia and Reddit; translated using fairseq NMT models; scored by professional translators using Direct Assessment.
Scoring: Each sentence receives at least three DA scores (0‑100).

Usage Considerations

License: Unknown.
Metrics: Pearson correlation between predicted scores and human DA.

Additional Information

Contributors: Thanks to @VictorSanh for adding the dataset.

Detailed File Information

File Sizes: (bytes) – en‑de: 4 539 012, en‑zh: 4 269 820, etc.
Download Sizes: (bytes) – en‑de: 3 293 699, en‑zh: 3 325 683, etc.

Feature Definitions

segid: int32 – segment identifier
translation: string – source and target text
scores: float32 – list of DA scores
mean: float32 – average score
z_scores: float32 – z‑standardized scores
z_mean: float32 – mean of z‑scores
model_score: float32 – model‑predicted score
doc_id: string – document identifier
nmt_output: string – NMT system output
word_probas: float32 – word‑level probabilities

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Machine Translation

Quality Assessment

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →