DATASET
Open Source Community
CCNC
CCNC is a large Chinese name corpus containing 3,658,109 name samples, sourced from the Name Encyclopedia and the Chinese Personal Names Corpus. After processing and adding phonetic annotations, it is used for Chinese name research and entity recognition.
Updated 6/28/2021
github
Description
Dataset Overview
Basic Statistics
| names | last names | first names | Male (M) | Female (F) | Unknown |
|---|---|---|---|---|---|
| 3658109 | 808 | 710594 | 2054134 | 1509650 | 94325 |
Pinyin Version
- Surname pinyin uses a custom Chinese Surname Phonetic Dictionary; given names use pypinyin.
- Three versions of the corpus are available for download: pure Chinese version and two pinyin versions.
Sources
- Dataset origins:
- Name Encyclopedia: contributes 2,513,097 entries.
- Chinese Personal Names Corpus: contributes 1,145,012 entries.
- Processing details:
- Distinguished names that were originally unsegmented.
- Removed approximately three hundred thousand duplicate entries.
- Treated same-name different-gender instances as separate entries.
- Unknown-gender entries are from the Chinese Personal Names Corpus.
Chinese Surname Phonetic Dictionary
- Contains 1,606 Chinese surnames with their pinyin.
- 1,534 surnames and their phonetics are sourced from Mingba Baijia Xing; the remaining 72 were manually annotated by the author.
Train/Test/Predict Splits
- Provides code to split the corpus into training, testing, and prediction sets, default ratio 6:2:2.
- Pre-split full Chinese character version compressed file download link: full Chinese character version.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Chinese Name Research
Entity Recognition
Source
Organization: github
Created: 6/24/2021
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.