Back to datasets
Dataset assetOpen Source CommunityMachine TranslationParallel Corpora

ted-parallel-corpus-Chinese-English

A parallel corpus of TED talk transcripts, providing tokenized Chinese and English texts, vocabularies, and processing scripts. The dataset offers high‑quality 10 M of bilingual text and detailed vocabularies, suitable for linguistic research and machine‑translation studies.

Source
github
Created
Dec 20, 2019
Updated
Feb 11, 2022
Signals
367 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

ted‑parallel‑corpus‑Chinese‑English

Description

This dataset contains a parallel corpus derived from TED talk transcripts, covering both Chinese and English.

Contents

  • English Text: Tokenized high‑quality text, total size 10 M.
  • Chinese Text: Tokenized with Jieba, total size 10 M.
  • Vocabularies: 43 k English tokens, 62 k Chinese tokens.
  • Processing Scripts: Python‑based spiders and utilities (currently without comments).

Sample Data

  • English Vocabulary: Includes special symbols <unk>, <s>, </s> and common words such as autotroph, monochromatic.
  • Chinese Vocabulary: Includes special symbols <unk>, <s>, </s> and common terms like “修理铺”, “随机存取”.
  • English Example: Well you can see where this is going.
  • Chinese Example: 你可以猜到事情是怎么发展的。

Characteristics

  • Sentence‑aligned bilingual texts, ideal for language learning and translation research.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio