DATASET
Open Source Community
nlp_chinese_corpus
A large‑scale Chinese natural‑language‑processing corpus containing diverse types of Chinese text such as Wikipedia, news, and encyclopedia Q&A, intended to support research and applications in Chinese NLP.
Updated 9/5/2019
github
Description
Dataset Overview
Dataset Goals
- Phase 1: 10 million‑scale Chinese corpora & 3 ten‑million‑scale corpora (by 2019‑05‑01)
- Phase 2: 30 million‑scale Chinese corpora & 10 ten‑million‑scale corpora & 1 hundred‑million‑scale corpus (by 2019‑12‑31)
Dataset Updates
- Added high‑quality community Q&A in JSON format (webtext2019zh), 4.1 M QA pairs, suitable for training large models
- Added 5.2 M translation pairs (translation2019zh)
Dataset Contents
-
Wikipedia (wiki2019zh)
- Size: 1 M well‑structured Chinese articles
- Uses: general Chinese corpus, pre‑training, word‑embedding, knowledge QA
- Format: {"id":
,"url": ,"title":
-
News Corpus (news2016zh)
- Size: 2.5 M news articles with keywords and descriptions
- Uses: general Chinese corpus, word‑embedding, pre‑training, headline generation, keyword generation
- Format: {news_id:<news_id>,title:
-
Encyclopedia QA (baike2018qa)
- Size: 1.5 M QA pairs with question types
- Uses: general Chinese corpus, word‑embedding, pre‑training, encyclopedia QA
- Format: {"qid":
,"category": ,"title":
-
Community QA (webtext2019zh)
- Size: 4.1 M high‑quality community QA pairs
- Uses: encyclopedia QA construction, topic prediction, community QA systems, general Chinese corpus, large‑model pre‑training
- Format: {"qid":
,"title":
-
Translation Corpus (translation2019zh)
- Size: 5.2 M English‑Chinese parallel sentences
- Uses: Chinese‑English translation systems, general Chinese corpus, word‑embedding, pre‑training
- Format: {"english":
,"chinese": }
Contribution
- Submit contributions via email to nlp_chinese_corpus@163.com
- Top 20 contributors receive keyboards, mice, monitors, wireless headsets, smart speakers, or equivalent items.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Natural Language Processing
Chinese Corpus
Source
Organization: github
Created: 8/15/2019
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.