Dataset assetOpen Source CommunityEntity RecognitionChinese Name Research

CCNC

CCNC is a large Chinese name corpus containing 3,658,109 name samples, sourced from the Name Encyclopedia and the Chinese Personal Names Corpus. After processing and adding phonetic annotations, it is used for Chinese name research and entity recognition.

Source

github

Created

Jun 24, 2021

Updated

Jun 28, 2021

Signals

602 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Basic Statistics

names	last names	first names	Male (M)	Female (F)	Unknown
3658109	808	710594	2054134	1509650	94325

Pinyin Version

Surname pinyin uses a custom Chinese Surname Phonetic Dictionary; given names use pypinyin.
Three versions of the corpus are available for download: pure Chinese version and two pinyin versions.

Sources

Dataset origins:
- Name Encyclopedia: contributes 2,513,097 entries.
- Chinese Personal Names Corpus: contributes 1,145,012 entries.
Processing details:
- Distinguished names that were originally unsegmented.
- Removed approximately three hundred thousand duplicate entries.
- Treated same-name different-gender instances as separate entries.
- Unknown-gender entries are from the Chinese Personal Names Corpus.

Chinese Surname Phonetic Dictionary

Contains 1,606 Chinese surnames with their pinyin.
1,534 surnames and their phonetics are sourced from Mingba Baijia Xing; the remaining 72 were manually annotated by the author.

Train/Test/Predict Splits

Provides code to split the corpus into training, testing, and prediction sets, default ratio 6:2:2.
Pre-split full Chinese character version compressed file download link: full Chinese character version.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.