Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingPretraining Corpus

CLUECorpus2020

By cleaning the Chinese portion of the Common Crawl corpus, a high‑quality 100 GB Chinese pre‑training corpus was obtained. This dataset can be directly used for pre‑training, language‑modeling, or language‑generation tasks, and a small vocabulary specifically for Simplified‑Chinese NLP tasks has been released.

Source
github
Created
Nov 25, 2020
Updated
Dec 2, 2020
Signals
586 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

CLUECorpus2020

  • Source & Processing: The Chinese segment of Common Crawl was cleaned to produce a 100 GB high‑quality Chinese pre‑training corpus.
  • Characteristics:
    • Directly usable for pre‑training, language‑modeling, or generative tasks.
    • A small vocabulary dedicated to Simplified‑Chinese NLP tasks is released.
  • Vocabulary Statistics:
    Token TypeGoogleCLUE
    Simplified Chinese113785689
    English35291320
    Numbers1179140
    Special Tokens106106
    Other Tokens959766
    Total211288021
  • Experimental Results:
    • Comparative performance on BERT‑base using the small dataset, with detailed metrics on AFQMC, TNEWS, IFLYTEK, CMNLI, etc.
  • Data Access:
    • Application Process: Applicants must submit the research purpose, planned usage, institution, and a brief introduction via email, and commit not to redistribute to third parties.
    • Email: CLUEbenchmark@163.com, Subject: CLUECorpus2020 100G Corpus.

CLUECorpusSmall (14 GB)

  • Use Cases: Suitable for language modeling, pre‑training, or generative tasks.
  • Size: Over 14 GB, nearly 4,000 well‑defined txt files, containing 5 billion characters.
  • Source: Mainly derived from the nlp_chinese_corpus project.
  • Format: One sentence per line, with blank lines separating documents.
  • Sub‑corpora:
    • News Corpus news2016zh_corpus: 8 GB, 2,000 small files.
    • Web Interaction Corpus webText2019zh_corpus: 3 GB, over 900 small files.
    • Wikipedia Corpus wiki2019zh_corpus: ~1.1 GB, about 300 small files.
    • Comments Corpus comments2019zh_corpus: ~2.3 GB, 784 small files.
  • Download Method: Provided via Baidu Cloud with password protection.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.