Dataset assetOpen Source CommunityNatural Language ProcessingPretraining Corpus

CLUECorpus2020

By cleaning the Chinese portion of the Common Crawl corpus, a high‑quality 100 GB Chinese pre‑training corpus was obtained. This dataset can be directly used for pre‑training, language‑modeling, or language‑generation tasks, and a small vocabulary specifically for Simplified‑Chinese NLP tasks has been released.

Source

github

Created

Nov 25, 2020

Updated

Dec 2, 2020

Signals

586 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

CLUECorpus2020

Source & Processing: The Chinese segment of Common Crawl was cleaned to produce a 100 GB high‑quality Chinese pre‑training corpus.
Characteristics:
- Directly usable for pre‑training, language‑modeling, or generative tasks.
- A small vocabulary dedicated to Simplified‑Chinese NLP tasks is released.
Vocabulary Statistics:
Token Type Google CLUE
Simplified Chinese 11378 5689
English 3529 1320
Numbers 1179 140
Special Tokens 106 106
Other Tokens 959 766
Total 21128 8021
Experimental Results:
- Comparative performance on BERT‑base using the small dataset, with detailed metrics on AFQMC, TNEWS, IFLYTEK, CMNLI, etc.
Data Access:
- Application Process: Applicants must submit the research purpose, planned usage, institution, and a brief introduction via email, and commit not to redistribute to third parties.
- Email: CLUEbenchmark@163.com, Subject: CLUECorpus2020 100G Corpus.

Token Type	Google	CLUE
Simplified Chinese	11378	5689
English	3529	1320
Numbers	1179	140
Special Tokens	106	106
Other Tokens	959	766
Total	21128	8021

CLUECorpusSmall (14 GB)

Use Cases: Suitable for language modeling, pre‑training, or generative tasks.
Size: Over 14 GB, nearly 4,000 well‑defined txt files, containing 5 billion characters.
Source: Mainly derived from the nlp_chinese_corpus project.
Format: One sentence per line, with blank lines separating documents.
Sub‑corpora:
- News Corpus news2016zh_corpus: 8 GB, 2,000 small files.
- Web Interaction Corpus webText2019zh_corpus: 3 GB, over 900 small files.
- Wikipedia Corpus wiki2019zh_corpus: ~1.1 GB, about 300 small files.
- Comments Corpus comments2019zh_corpus: ~2.3 GB, 784 small files.
Download Method: Provided via Baidu Cloud with password protection.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

CLUECorpus2020

Dataset description and usage context

Dataset Overview

CLUECorpus2020

CLUECorpusSmall (14 GB)

Pair the dataset with AI analysis and content workflows.

CLUECorpusSmall (14 GB)