JUHE API Marketplace
DATASET
Open Source Community

nlp_chinese_corpus

A large‑scale Chinese natural‑language‑processing corpus containing diverse types of Chinese text such as Wikipedia, news, and encyclopedia Q&A, intended to support research and applications in Chinese NLP.

Updated 9/5/2019
github

Description

Dataset Overview

Dataset Goals

  • Phase 1: 10 million‑scale Chinese corpora & 3 ten‑million‑scale corpora (by 2019‑05‑01)
  • Phase 2: 30 million‑scale Chinese corpora & 10 ten‑million‑scale corpora & 1 hundred‑million‑scale corpus (by 2019‑12‑31)

Dataset Updates

  • Added high‑quality community Q&A in JSON format (webtext2019zh), 4.1 M QA pairs, suitable for training large models
  • Added 5.2 M translation pairs (translation2019zh)

Dataset Contents

  1. Wikipedia (wiki2019zh)

    • Size: 1 M well‑structured Chinese articles
    • Uses: general Chinese corpus, pre‑training, word‑embedding, knowledge QA
    • Format: {"id":,"url":,"title":
  2. News Corpus (news2016zh)

    • Size: 2.5 M news articles with keywords and descriptions
    • Uses: general Chinese corpus, word‑embedding, pre‑training, headline generation, keyword generation
    • Format: {news_id:<news_id>,title:
  3. Encyclopedia QA (baike2018qa)

    • Size: 1.5 M QA pairs with question types
    • Uses: general Chinese corpus, word‑embedding, pre‑training, encyclopedia QA
    • Format: {"qid":,"category":,"title":
  4. Community QA (webtext2019zh)

    • Size: 4.1 M high‑quality community QA pairs
    • Uses: encyclopedia QA construction, topic prediction, community QA systems, general Chinese corpus, large‑model pre‑training
    • Format: {"qid":,"title":
  5. Translation Corpus (translation2019zh)

    • Size: 5.2 M English‑Chinese parallel sentences
    • Uses: Chinese‑English translation systems, general Chinese corpus, word‑embedding, pre‑training
    • Format: {"english":,"chinese":}

Contribution

  • Submit contributions via email to nlp_chinese_corpus@163.com
  • Top 20 contributors receive keyboards, mice, monitors, wireless headsets, smart speakers, or equivalent items.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Chinese Corpus

Source

Organization: github

Created: 8/15/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.