JUHE API Marketplace
DATASET
Open Source Community

CCNC

CCNC is a large Chinese name corpus containing 3,658,109 name samples, sourced from the Name Encyclopedia and the Chinese Personal Names Corpus. After processing and adding phonetic annotations, it is used for Chinese name research and entity recognition.

Updated 6/28/2021
github

Description

Dataset Overview

Basic Statistics

nameslast namesfirst namesMale (M)Female (F)Unknown
36581098087105942054134150965094325

Pinyin Version

Sources

  • Dataset origins:
  • Processing details:
    • Distinguished names that were originally unsegmented.
    • Removed approximately three hundred thousand duplicate entries.
    • Treated same-name different-gender instances as separate entries.
    • Unknown-gender entries are from the Chinese Personal Names Corpus.

Chinese Surname Phonetic Dictionary

  • Contains 1,606 Chinese surnames with their pinyin.
  • 1,534 surnames and their phonetics are sourced from Mingba Baijia Xing; the remaining 72 were manually annotated by the author.

Train/Test/Predict Splits

  • Provides code to split the corpus into training, testing, and prediction sets, default ratio 6:2:2.
  • Pre-split full Chinese character version compressed file download link: full Chinese character version.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Chinese Name Research
Entity Recognition

Source

Organization: github

Created: 6/24/2021

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.