JUHE API Marketplace
DATASET
Open Source Community

NER_corpus_chinese

Chinese NER corpus containing multiple versions such as People's Daily 1998 edition and MSRA corpus, used for named entity recognition tasks.

Updated 5/16/2024
github

Description

NER_corpus_chinese Dataset Overview

Main Corpora

  1. People's Daily 1998 Edition:

    • Used for word segmentation training.
    • Entity tags include /t, /nr, /ns, /nt.
  2. MSRA Corpus:

    • Annotated in BIO format.
    • Contains three entity types: person, location, organization.
  3. Boson NLP Corpus:

    • Contains 2,000 paragraphs.
    • Annotated with six entity types, including time, company, and product names.
    • Small scale, about 1 MB.

Additional Research Corpora

  1. People's Daily 2014 Edition:

    • Annotation format differs significantly from the 1998 edition.
    • More fine‑grained POS tags and supports nested entity annotations.
    • Size about 17.5 million characters, requiring complex preprocessing.
  2. Unnamed Corpus:

    • Annotated in BIO format.
    • Contains three entity types: person, location, organization.
    • Size about 1.3 million characters.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Named Entity Recognition
Natural Language Processing

Source

Organization: github

Created: 4/3/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.