Back to datasets
Dataset assetOpen Source CommunityMultilingual TranslationUnited Nations Documents

UN Parallel Corpora

A large multilingual United Nations document collection providing high‑quality parallel translations across multiple languages.

Source
github
Created
Nov 12, 2024
Updated
Nov 12, 2024
Signals
255 views
Availability
Linked source ready
Overview

Dataset description and usage context

UN Document Translator Dataset Overview

Description

  • Dataset Name: UN Parallel Corpora
  • Purpose: Fine‑tuning MarianMT models to support multilingual translation of United Nations documents.
  • Features: Contains high‑quality multilingual parallel texts covering the United Nations' official languages.

Characteristics

  • Multilingual Support: Supports translation among the six official UN languages.
  • High‑Quality Parallel Text: Offers precise parallel translations suitable for formal, technical, and nuanced language.
  • UN‑Specific Terminology: Accurately handles terminology unique to the United Nations.

Applications

  • Model Fine‑tuning: Improves MarianMT models for higher accuracy and contextual awareness in UN document translation.
  • Translation Services: Hosted on the Hugging Face platform for developers, linguists, and international organizations.

Performance

  • Translation Accuracy: Cosine similarity scores above 93 % between model translations and native‑speaker references on unseen data.
  • Advantage: Outperforms human translations in blind tests.

Source

  • Origin: United Nations Parallel Corpus provided by the UN.
  • Citation:
    • Ziemski, M., Junczys‑Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio