JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

JESC

Machine Translation
Japanese‑English Corpus

The JESC dataset is a Japanese‑English subtitle corpus created by Stanford University, Google Brain, and Rakuten Institute of Technology. Sourced from movie and TV subtitles on the web, it is one of the largest free EN‑JA corpora, focusing on conversational language. It contains 2.8 million sentence pairs covering everyday language, slang, instructions, and narratives. Licensed under CC‑BY‑4.0, it includes pre‑processed data with tokenized train/dev/test splits, primarily intended for translation tasks.

huggingface
View Details

haoranxu/WMT22-Test

Machine Translation

The dataset provides configurations for multiple language pairs, including cs‑en (Czech‑English), de‑en (German‑English), en‑cs, en‑de, en‑is (English‑Icelandic), en‑ru, en‑zh, is‑en, ru‑en, and zh‑en. For each configuration, features consist of string columns for the two languages, and a test split with specified byte size and number of examples. The dataset is intended for machine translation tasks.

hugging_face
View Details

ted-parallel-corpus-Chinese-English

Parallel Corpora
Machine Translation

A parallel corpus of TED talk transcripts, providing tokenized Chinese and English texts, vocabularies, and processing scripts. The dataset offers high‑quality 10 M of bilingual text and detailed vocabularies, suitable for linguistic research and machine‑translation studies.

github
View Details

Lauler/flan-norwegian

Natural Language Processing
Machine Translation

This dataset includes multiple feature fields, such as inputs, targets, task, index, as well as input and target fields that have been normalized and back‑translation processed. The dataset is split into training, validation, and test sets, containing 2,771,562, 23,860, and 734,178 examples respectively. The total size of the dataset is 12,154,335,861.0 bytes, with a download size of 5,880,786,502 bytes.

hugging_face
View Details

wmt/wmt16

Machine Translation
Natural Language Processing

This is a translation dataset based on statmt.org data, supporting multiple language pairs including cs‑en, de‑en, fi‑en, ro‑en, ru‑en and tr‑en. The dataset size ranges from 10 MB to 100 MB and is primarily used for translation tasks. The creators did not provide annotations; the data originates from several extended corpora such as europarl_bilingual, news_commentary, setimes and un_multi. The download size is 1.69 GB, the generated dataset size is 297.28 MB, and total disk usage is 1.99 GB.

hugging_face
View Details

qgyd2021/language_identification

Language Identification
Machine Translation

该数据集包含了多个子数据集,主要用于语种识别、多语言语料分析和机器翻译任务。数据集涵盖了多种语言,包括中文、英文、日文、德文、法文、西班牙文等。具体数据集包括多语言亚马逊评论语料库(MARC)、跨语言句子理解数据集(XNLI)、北欧语言识别数据集(nordic_langid)、专利摘要平行语料库(ParaPat)等。这些数据集广泛应用于自然语言处理领域,特别是多语言文本分类、语种识别和机器翻译任务。

hugging_face
View Details

wmt/wmt20_mlqe_task1

Machine Translation
Quality Assessment

This dataset is part of the WMT20 Multilingual Quality Estimation (MLQE) task, used to evaluate the quality of neural machine translation outputs without reference translations. It includes translation pairs for several language directions (e.g., en‑de, en‑zh) sourced from Wikipedia and Reddit. Each sentence is annotated with Direct Assessment (DA) scores ranging from 0 to 100 by professional translators. The dataset is split into training, validation, and test sets (7 k training, 1 k validation, 1 k test per configuration) and is intended for research on automatic quality estimation of NMT systems.

hugging_face
View Details

JParaCrawl

Machine Translation
Natural Language Processing

JParaCrawl is the largest publicly available English‑Japanese parallel corpus created by NTT. It is constructed via large‑scale web crawling and automatic alignment of parallel sentences. The dataset includes a training set with a massive number of bytes and examples. Each data instance contains an English‑Japanese sentence pair. The dataset is distributed under its own license.

huggingface
View Details

alpaca-chinese-dataset

Instruction Fine‑tuning
Machine Translation

This dataset comprises a mixed Chinese‑English corpus designed for bilingual fine‑tuning and ongoing data correction. The original Alpaca English dataset contains numerous issues, such as erroneous mathematical samples, mislabeled output fields, and misaligned tags. This dataset rectifies those problems, translates the corrected samples into Chinese, and manually rewrites instructions where literal translation leads to loss of rhyme, tense inconsistencies, or other nuances. It focuses on: (1) fixing problems in the original English data, (2) translating into Chinese, (3) adjusting samples affected by direct translation, (4) leaving code and special outputs unchanged, and (5) aligning special tags or refusal outputs.

github
View Details

wmt/wmt14

Machine Translation
Natural Language Processing

The WMT14 dataset is a multilingual dataset for machine translation tasks, containing translation pairs for several language pairs such as Czech‑English (cs‑en), German‑English (de‑en), French‑English (fr‑en), Hindi‑English (hi‑en) and Russian‑English (ru‑en). Dataset size varies from a few MB to several tens of GB depending on the language pair. The dataset comprises training, validation, and test splits; each language pair includes a `translation` field containing the source and target texts. It is built from statmt.org data and allows users to customize language pairs and data sources.

hugging_face
View Details

TEDtalk-en-ja

Machine Translation
Japanese-English Translation

This dataset comprises Japanese‑English translation pairs extracted from the Multitarget TED Talks (MTTT) dataset, based on TED talks. The data originates from WIT³ and is used in the IWSLT machine translation evaluation campaign. It contains a single training split with 158,535 examples, each consisting of an English sentence and a Japanese sentence. The dataset is released under the CC BY‑NC‑ND 4.0 license, requiring acknowledgment of TED's contribution.

huggingface
View Details

wmt/wmt18

Machine Translation
Natural Language Processing

The WMT18 dataset is a multilingual machine‑translation corpus containing parallel data for many language pairs, such as Czech‑English, German‑English, Estonian‑English, Finnish‑English, Kazakh‑English, Russian‑English, Turkish‑English, and Chinese‑English. The dataset is divided into training, validation, and test splits, with varying sizes per language pair. Sources include Europarl, News Commentary, OPUS ParaCrawl, SETimes, and UN Multi. Its purpose is to support MT research, allowing users to select arbitrary language pairs and subsets to create custom corpora.

hugging_face
View Details

LTRC Hindi-Telugu Parallel Corpus

Machine Translation
Low‑Resource Languages

We provide a Hindi‑Telugu parallel corpus across various technical domains (natural sciences, computer science, law, healthcare, and general domain). The corpus contains 700 K parallel sentences, of which 535 K were created through extraction, alignment, manual translation, iterative back‑translation with post‑editing, and 165 K were collected from the public domain. We report comparative evaluations of the corpus’s representativeness and diversity. The corpus is pre‑processed for machine translation; we trained a neural MT system and reported state‑of‑the‑art baselines on several domains and benchmarks. This defines a new task for domain‑specific MT for low‑resource language pairs such as Hindi‑Telugu. The 535 K curated corpus is freely available for non‑commercial research and is, to our knowledge, the largest, carefully curated, publicly available Hindi‑Telugu domain parallel corpus.

github
View Details

pranjali97/labelled_vi_ko_raw_text

Machine Translation
Text Classification

The dataset named labelled_vi_ko_raw_text includes three primary features: src (source text), tgt (target text), and classifier_labels (classification labels). The dataset is primarily used for training, containing 40,000 samples, with a total data size of 9,844,626 bytes and a download size of 5,466,676 bytes.

hugging_face
View Details