DATASET
Open Source Community
UN Parallel Corpora
A large multilingual United Nations document collection providing high‑quality parallel translations across multiple languages.
Updated 11/12/2024
github
Description
UN Document Translator Dataset Overview
Description
- Dataset Name: UN Parallel Corpora
- Purpose: Fine‑tuning MarianMT models to support multilingual translation of United Nations documents.
- Features: Contains high‑quality multilingual parallel texts covering the United Nations' official languages.
Characteristics
- Multilingual Support: Supports translation among the six official UN languages.
- High‑Quality Parallel Text: Offers precise parallel translations suitable for formal, technical, and nuanced language.
- UN‑Specific Terminology: Accurately handles terminology unique to the United Nations.
Applications
- Model Fine‑tuning: Improves MarianMT models for higher accuracy and contextual awareness in UN document translation.
- Translation Services: Hosted on the Hugging Face platform for developers, linguists, and international organizations.
Performance
- Translation Accuracy: Cosine similarity scores above 93 % between model translations and native‑speaker references on unseen data.
- Advantage: Outperforms human translations in blind tests.
Source
- Origin: United Nations Parallel Corpus provided by the UN.
- Citation:
- Ziemski, M., Junczys‑Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Multilingual Translation
United Nations Documents
Source
Organization: github
Created: 11/12/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.