Back to datasets
Dataset assetOpen Source CommunityEntity RecognitionChinese Name Research
CCNC
CCNC is a large Chinese name corpus containing 3,658,109 name samples, sourced from the Name Encyclopedia and the Chinese Personal Names Corpus. After processing and adding phonetic annotations, it is used for Chinese name research and entity recognition.
Source
github
Created
Jun 24, 2021
Updated
Jun 28, 2021
Signals
602 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Basic Statistics
| names | last names | first names | Male (M) | Female (F) | Unknown |
|---|---|---|---|---|---|
| 3658109 | 808 | 710594 | 2054134 | 1509650 | 94325 |
Pinyin Version
- Surname pinyin uses a custom Chinese Surname Phonetic Dictionary; given names use pypinyin.
- Three versions of the corpus are available for download: pure Chinese version and two pinyin versions.
Sources
- Dataset origins:
- Name Encyclopedia: contributes 2,513,097 entries.
- Chinese Personal Names Corpus: contributes 1,145,012 entries.
- Processing details:
- Distinguished names that were originally unsegmented.
- Removed approximately three hundred thousand duplicate entries.
- Treated same-name different-gender instances as separate entries.
- Unknown-gender entries are from the Chinese Personal Names Corpus.
Chinese Surname Phonetic Dictionary
- Contains 1,606 Chinese surnames with their pinyin.
- 1,534 surnames and their phonetics are sourced from Mingba Baijia Xing; the remaining 72 were manually annotated by the author.
Train/Test/Predict Splits
- Provides code to split the corpus into training, testing, and prediction sets, default ratio 6:2:2.
- Pre-split full Chinese character version compressed file download link: full Chinese character version.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.