Back to datasets
Dataset assetOpen Source CommunityEntity RecognitionChinese Name Research

CCNC

CCNC is a large Chinese name corpus containing 3,658,109 name samples, sourced from the Name Encyclopedia and the Chinese Personal Names Corpus. After processing and adding phonetic annotations, it is used for Chinese name research and entity recognition.

Source
github
Created
Jun 24, 2021
Updated
Jun 28, 2021
Signals
602 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Statistics

nameslast namesfirst namesMale (M)Female (F)Unknown
36581098087105942054134150965094325

Pinyin Version

Sources

  • Dataset origins:
  • Processing details:
    • Distinguished names that were originally unsegmented.
    • Removed approximately three hundred thousand duplicate entries.
    • Treated same-name different-gender instances as separate entries.
    • Unknown-gender entries are from the Chinese Personal Names Corpus.

Chinese Surname Phonetic Dictionary

  • Contains 1,606 Chinese surnames with their pinyin.
  • 1,534 surnames and their phonetics are sourced from Mingba Baijia Xing; the remaining 72 were manually annotated by the author.

Train/Test/Predict Splits

  • Provides code to split the corpus into training, testing, and prediction sets, default ratio 6:2:2.
  • Pre-split full Chinese character version compressed file download link: full Chinese character version.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.