Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingChinese Corpus

nlp_chinese_corpus

A large‑scale Chinese natural‑language‑processing corpus containing diverse types of Chinese text such as Wikipedia, news, and encyclopedia Q&A, intended to support research and applications in Chinese NLP.

Source
github
Created
Aug 15, 2019
Updated
Sep 5, 2019
Signals
213 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Goals

  • Phase 1: 10 million‑scale Chinese corpora & 3 ten‑million‑scale corpora (by 2019‑05‑01)
  • Phase 2: 30 million‑scale Chinese corpora & 10 ten‑million‑scale corpora & 1 hundred‑million‑scale corpus (by 2019‑12‑31)

Dataset Updates

  • Added high‑quality community Q&A in JSON format (webtext2019zh), 4.1 M QA pairs, suitable for training large models
  • Added 5.2 M translation pairs (translation2019zh)

Dataset Contents

  1. Wikipedia (wiki2019zh)

    • Size: 1 M well‑structured Chinese articles
    • Uses: general Chinese corpus, pre‑training, word‑embedding, knowledge QA
    • Format: {"id":,"url":,"title":
  2. News Corpus (news2016zh)

    • Size: 2.5 M news articles with keywords and descriptions
    • Uses: general Chinese corpus, word‑embedding, pre‑training, headline generation, keyword generation
    • Format: {news_id:<news_id>,title:
  3. Encyclopedia QA (baike2018qa)

    • Size: 1.5 M QA pairs with question types
    • Uses: general Chinese corpus, word‑embedding, pre‑training, encyclopedia QA
    • Format: {"qid":,"category":,"title":
  4. Community QA (webtext2019zh)

    • Size: 4.1 M high‑quality community QA pairs
    • Uses: encyclopedia QA construction, topic prediction, community QA systems, general Chinese corpus, large‑model pre‑training
    • Format: {"qid":,"title":
  5. Translation Corpus (translation2019zh)

    • Size: 5.2 M English‑Chinese parallel sentences
    • Uses: Chinese‑English translation systems, general Chinese corpus, word‑embedding, pre‑training
    • Format: {"english":,"chinese":}

Contribution

  • Submit contributions via email to nlp_chinese_corpus@163.com
  • Top 20 contributors receive keyboards, mice, monitors, wireless headsets, smart speakers, or equivalent items.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio