DATASET
Open Source Community
CLUECorpus2020
By cleaning the Chinese portion of the Common Crawl corpus, a high‑quality 100 GB Chinese pre‑training corpus was obtained. This dataset can be directly used for pre‑training, language‑modeling, or language‑generation tasks, and a small vocabulary specifically for Simplified‑Chinese NLP tasks has been released.
Updated 12/2/2020
github
Description
Dataset Overview
CLUECorpus2020
- Source & Processing: The Chinese segment of Common Crawl was cleaned to produce a 100 GB high‑quality Chinese pre‑training corpus.
- Characteristics:
- Directly usable for pre‑training, language‑modeling, or generative tasks.
- A small vocabulary dedicated to Simplified‑Chinese NLP tasks is released.
- Vocabulary Statistics:
Token Type Google CLUE Simplified Chinese 11378 5689 English 3529 1320 Numbers 1179 140 Special Tokens 106 106 Other Tokens 959 766 Total 21128 8021 - Experimental Results:
- Comparative performance on BERT‑base using the small dataset, with detailed metrics on AFQMC, TNEWS, IFLYTEK, CMNLI, etc.
- Data Access:
- Application Process: Applicants must submit the research purpose, planned usage, institution, and a brief introduction via email, and commit not to redistribute to third parties.
- Email: CLUEbenchmark@163.com, Subject: CLUECorpus2020 100G Corpus.
CLUECorpusSmall (14 GB)
- Use Cases: Suitable for language modeling, pre‑training, or generative tasks.
- Size: Over 14 GB, nearly 4,000 well‑defined txt files, containing 5 billion characters.
- Source: Mainly derived from the nlp_chinese_corpus project.
- Format: One sentence per line, with blank lines separating documents.
- Sub‑corpora:
- News Corpus
news2016zh_corpus: 8 GB, 2,000 small files. - Web Interaction Corpus
webText2019zh_corpus: 3 GB, over 900 small files. - Wikipedia Corpus
wiki2019zh_corpus: ~1.1 GB, about 300 small files. - Comments Corpus
comments2019zh_corpus: ~2.3 GB, 784 small files.
- News Corpus
- Download Method: Provided via Baidu Cloud with password protection.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Natural Language Processing
Pretraining Corpus
Source
Organization: github
Created: 11/25/2020
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.