High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

ted-parallel-corpus-Chinese-English

A parallel corpus of TED talk transcripts, providing tokenized Chinese and English texts, vocabularies, and processing scripts. The dataset offers high‑quality 10 M of bilingual text and detailed vocabularies, suitable for linguistic research and machine‑translation studies.

github

View Details