Dataset assetOpen Source CommunityMachine TranslationParallel Corpora

ted-parallel-corpus-Chinese-English

A parallel corpus of TED talk transcripts, providing tokenized Chinese and English texts, vocabularies, and processing scripts. The dataset offers high‑quality 10 M of bilingual text and detailed vocabularies, suitable for linguistic research and machine‑translation studies.

Source

github

Created

Dec 20, 2019

Updated

Feb 11, 2022

Signals

367 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

ted‑parallel‑corpus‑Chinese‑English

Description

This dataset contains a parallel corpus derived from TED talk transcripts, covering both Chinese and English.

English Text: Tokenized high‑quality text, total size 10 M.
Chinese Text: Tokenized with Jieba, total size 10 M.
Vocabularies: 43 k English tokens, 62 k Chinese tokens.
Processing Scripts: Python‑based spiders and utilities (currently without comments).

Sample Data

English Vocabulary: Includes special symbols <unk>, <s>, </s> and common words such as autotroph, monochromatic.
Chinese Vocabulary: Includes special symbols <unk>, <s>, </s> and common terms like “修理铺”, “随机存取”.
English Example: Well you can see where this is going.
Chinese Example: 你可以猜到事情是怎么发展的。

Characteristics

Sentence‑aligned bilingual texts, ideal for language learning and translation research.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio