Dataset Hub

haoranxu/WMT22-Test

The dataset provides configurations for multiple language pairs, including cs‑en (Czech‑English), de‑en (German‑English), en‑cs, en‑de, en‑is (English‑Icelandic), en‑ru, en‑zh, is‑en, ru‑en, and zh‑en. For each configuration, features consist of string columns for the two languages, and a test split with specified byte size and number of examples. The dataset is intended for machine translation tasks.

ted-parallel-corpus-Chinese-English

Parallel Corpora

A parallel corpus of TED talk transcripts, providing tokenized Chinese and English texts, vocabularies, and processing scripts. The dataset offers high‑quality 10 M of bilingual text and detailed vocabularies, suitable for linguistic research and machine‑translation studies.

github

Lauler/flan-norwegian

This dataset includes multiple feature fields, such as inputs, targets, task, index, as well as input and target fields that have been normalized and back‑translation processed. The dataset is split into training, validation, and test sets, containing 2,771,562, 23,860, and 734,178 examples respectively. The total size of the dataset is 12,154,335,861.0 bytes, with a download size of 5,880,786,502 bytes.

wmt/wmt16

This is a translation dataset based on statmt.org data, supporting multiple language pairs including cs‑en, de‑en, fi‑en, ro‑en, ru‑en and tr‑en. The dataset size ranges from 10 MB to 100 MB and is primarily used for translation tasks. The creators did not provide annotations; the data originates from several extended corpora such as europarl_bilingual, news_commentary, setimes and un_multi. The download size is 1.69 GB, the generated dataset size is 297.28 MB, and total disk usage is 1.99 GB.

qgyd2021/language_identification

Language Identification

该数据集包含了多个子数据集，主要用于语种识别、多语言语料分析和机器翻译任务。数据集涵盖了多种语言，包括中文、英文、日文、德文、法文、西班牙文等。具体数据集包括多语言亚马逊评论语料库（MARC）、跨语言句子理解数据集（XNLI）、北欧语言识别数据集（nordic_langid）、专利摘要平行语料库（ParaPat）等。这些数据集广泛应用于自然语言处理领域，特别是多语言文本分类、语种识别和机器翻译任务。

wmt/wmt20_mlqe_task1

Quality Assessment

This dataset is part of the WMT20 Multilingual Quality Estimation (MLQE) task, used to evaluate the quality of neural machine translation outputs without reference translations. It includes translation pairs for several language directions (e.g., en‑de, en‑zh) sourced from Wikipedia and Reddit. Each sentence is annotated with Direct Assessment (DA) scores ranging from 0 to 100 by professional translators. The dataset is split into training, validation, and test sets (7 k training, 1 k validation, 1 k test per configuration) and is intended for research on automatic quality estimation of NMT systems.

JParaCrawl

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences. The dataset includes a training set with a massive number of bytes and examples. Each data instance contains an English‑Japanese sentence pair. The dataset is distributed under its own license.

huggingface

alpaca-chinese-dataset

Instruction Fine‑tuning

This dataset comprises a mixed Chinese‑English corpus designed for bilingual fine‑tuning and ongoing data correction. The original Alpaca English dataset contains numerous issues, such as erroneous mathematical samples, mislabeled output fields, and misaligned tags. This dataset rectifies those problems, translates the corrected samples into Chinese, and manually rewrites instructions where literal translation leads to loss of rhyme, tense inconsistencies, or other nuances. It focuses on: (1) fixing problems in the original English data, (2) translating into Chinese, (3) adjusting samples affected by direct translation, (4) leaving code and special outputs unchanged, and (5) aligning special tags or refusal outputs.

github

wmt/wmt14

The WMT14 dataset is a multilingual dataset for machine translation tasks, containing translation pairs for several language pairs such as Czech‑English (cs‑en), German‑English (de‑en), French‑English (fr‑en), Hindi‑English (hi‑en) and Russian‑English (ru‑en). Dataset size varies from a few MB to several tens of GB depending on the language pair. The dataset comprises training, validation, and test splits; each language pair includes a `translation` field containing the source and target texts. It is built from statmt.org data and allows users to customize language pairs and data sources.

TEDtalk-en-ja

Japanese-English Translation

This dataset comprises Japanese‑English translation pairs extracted from the Multitarget TED Talks (MTTT) dataset, based on TED talks. The data originates from WIT³ and is used in the IWSLT machine translation evaluation campaign. It contains a single training split with 158,535 examples, each consisting of an English sentence and a Japanese sentence. The dataset is released under the CC BY‑NC‑ND 4.0 license, requiring acknowledgment of TED's contribution.

huggingface

wmt/wmt18

The WMT18 dataset is a multilingual machine‑translation corpus containing parallel data for many language pairs, such as Czech‑English, German‑English, Estonian‑English, Finnish‑English, Kazakh‑English, Russian‑English, Turkish‑English, and Chinese‑English. The dataset is divided into training, validation, and test splits, with varying sizes per language pair. Sources include Europarl, News Commentary, OPUS ParaCrawl, SETimes, and UN Multi. Its purpose is to support MT research, allowing users to select arbitrary language pairs and subsets to create custom corpora.

LTRC Hindi-Telugu Parallel Corpus

Low‑Resource Languages

We provide a Hindi‑Telugu parallel corpus across various technical domains (natural sciences, computer science, law, healthcare, and general domain). The corpus contains 700 K parallel sentences, of which 535 K were created through extraction, alignment, manual translation, iterative back‑translation with post‑editing, and 165 K were collected from the public domain. We report comparative evaluations of the corpus’s representativeness and diversity. The corpus is pre‑processed for machine translation; we trained a neural MT system and reported state‑of‑the‑art baselines on several domains and benchmarks. This defines a new task for domain‑specific MT for low‑resource language pairs such as Hindi‑Telugu. The 535 K curated corpus is freely available for non‑commercial research and is, to our knowledge, the largest, carefully curated, publicly available Hindi‑Telugu domain parallel corpus.

github

pranjali97/labelled_vi_ko_raw_text

Text Classification

The dataset named labelled_vi_ko_raw_text includes three primary features: src (source text), tgt (target text), and classifier_labels (classification labels). The dataset is primarily used for training, containing 40,000 samples, with a total data size of 9,844,626 bytes and a download size of 5,466,676 bytes.