DATASET
Open Source Community
NER_corpus_chinese
Chinese NER corpus containing multiple versions such as People's Daily 1998 edition and MSRA corpus, used for named entity recognition tasks.
Updated 5/16/2024
github
Description
NER_corpus_chinese Dataset Overview
Main Corpora
-
People's Daily 1998 Edition:
- Used for word segmentation training.
- Entity tags include
/t,/nr,/ns,/nt.
-
MSRA Corpus:
- Annotated in BIO format.
- Contains three entity types: person, location, organization.
-
Boson NLP Corpus:
- Contains 2,000 paragraphs.
- Annotated with six entity types, including time, company, and product names.
- Small scale, about 1 MB.
Additional Research Corpora
-
People's Daily 2014 Edition:
- Annotation format differs significantly from the 1998 edition.
- More fine‑grained POS tags and supports nested entity annotations.
- Size about 17.5 million characters, requiring complex preprocessing.
-
Unnamed Corpus:
- Annotated in BIO format.
- Contains three entity types: person, location, organization.
- Size about 1.3 million characters.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Named Entity Recognition
Natural Language Processing
Source
Organization: github
Created: 4/3/2019
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.