JUHE API Marketplace
DATASET
Open Source Community

wmt/wmt20_mlqe_task1

This dataset is part of the WMT20 Multilingual Quality Estimation (MLQE) task, used to evaluate the quality of neural machine translation outputs without reference translations. It includes translation pairs for several language directions (e.g., en‑de, en‑zh) sourced from Wikipedia and Reddit. Each sentence is annotated with Direct Assessment (DA) scores ranging from 0 to 100 by professional translators. The dataset is split into training, validation, and test sets (7 k training, 1 k validation, 1 k test per configuration) and is intended for research on automatic quality estimation of NMT systems.

Updated 4/4/2024
hugging_face

Description

Dataset Overview

Dataset Name

  • Name: WMT20 – MultiLingual Quality Estimation (MLQE) Task 1
  • Alias: MLQE‑Task1

Summary

  • Purpose: Evaluate neural‑machine‑translation output quality without reference translations.
  • Content: Multilingual translation data primarily from Wikipedia, with some from Reddit.
  • Languages: German, English, Estonian, Nepali, Romanian, Russian, Sinhalese, Chinese.

Structure

  • Configurations: en‑de, en‑zh, et‑en, ne‑en, ro‑en, ru‑en, si‑en
  • Features: segid, translation, scores, mean, z_scores, z_mean, model_score, doc_id, nmt_output, word_probas.
  • Splits: train (7 k), validation (1 k), test (1 k) per configuration.

Creation

  • Source: Wikipedia and Reddit; translated using fairseq NMT models; scored by professional translators using Direct Assessment.
  • Scoring: Each sentence receives at least three DA scores (0‑100).

Usage Considerations

  • License: Unknown.
  • Metrics: Pearson correlation between predicted scores and human DA.

Additional Information

  • Contributors: Thanks to @VictorSanh for adding the dataset.

Detailed File Information

  • File Sizes: (bytes) – en‑de: 4 539 012, en‑zh: 4 269 820, etc.
  • Download Sizes: (bytes) – en‑de: 3 293 699, en‑zh: 3 325 683, etc.

Feature Definitions

  • segid: int32 – segment identifier
  • translation: string – source and target text
  • scores: float32 – list of DA scores
  • mean: float32 – average score
  • z_scores: float32 – z‑standardized scores
  • z_mean: float32 – mean of z‑scores
  • model_score: float32 – model‑predicted score
  • doc_id: string – document identifier
  • nmt_output: string – NMT system output
  • word_probas: float32 – word‑level probabilities

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Machine Translation
Quality Assessment

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.