Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingPretraining Corpus
CLUECorpus2020
By cleaning the Chinese portion of the Common Crawl corpus, a high‑quality 100 GB Chinese pre‑training corpus was obtained. This dataset can be directly used for pre‑training, language‑modeling, or language‑generation tasks, and a small vocabulary specifically for Simplified‑Chinese NLP tasks has been released.
Source
github
Created
Nov 25, 2020
Updated
Dec 2, 2020
Signals
586 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
CLUECorpus2020
- Source & Processing: The Chinese segment of Common Crawl was cleaned to produce a 100 GB high‑quality Chinese pre‑training corpus.
- Characteristics:
- Directly usable for pre‑training, language‑modeling, or generative tasks.
- A small vocabulary dedicated to Simplified‑Chinese NLP tasks is released.
- Vocabulary Statistics:
Token Type Google CLUE Simplified Chinese 11378 5689 English 3529 1320 Numbers 1179 140 Special Tokens 106 106 Other Tokens 959 766 Total 21128 8021 - Experimental Results:
- Comparative performance on BERT‑base using the small dataset, with detailed metrics on AFQMC, TNEWS, IFLYTEK, CMNLI, etc.
- Data Access:
- Application Process: Applicants must submit the research purpose, planned usage, institution, and a brief introduction via email, and commit not to redistribute to third parties.
- Email: CLUEbenchmark@163.com, Subject: CLUECorpus2020 100G Corpus.
CLUECorpusSmall (14 GB)
- Use Cases: Suitable for language modeling, pre‑training, or generative tasks.
- Size: Over 14 GB, nearly 4,000 well‑defined txt files, containing 5 billion characters.
- Source: Mainly derived from the nlp_chinese_corpus project.
- Format: One sentence per line, with blank lines separating documents.
- Sub‑corpora:
- News Corpus
news2016zh_corpus: 8 GB, 2,000 small files. - Web Interaction Corpus
webText2019zh_corpus: 3 GB, over 900 small files. - Wikipedia Corpus
wiki2019zh_corpus: ~1.1 GB, about 300 small files. - Comments Corpus
comments2019zh_corpus: ~2.3 GB, 784 small files.
- News Corpus
- Download Method: Provided via Baidu Cloud with password protection.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.