JUHE API Marketplace
DATASET
Open Source Community

CLUECorpus2020

By cleaning the Chinese portion of the Common Crawl corpus, a high‑quality 100 GB Chinese pre‑training corpus was obtained. This dataset can be directly used for pre‑training, language‑modeling, or language‑generation tasks, and a small vocabulary specifically for Simplified‑Chinese NLP tasks has been released.

Updated 12/2/2020
github

Description

Dataset Overview

CLUECorpus2020

  • Source & Processing: The Chinese segment of Common Crawl was cleaned to produce a 100 GB high‑quality Chinese pre‑training corpus.
  • Characteristics:
    • Directly usable for pre‑training, language‑modeling, or generative tasks.
    • A small vocabulary dedicated to Simplified‑Chinese NLP tasks is released.
  • Vocabulary Statistics:
    Token TypeGoogleCLUE
    Simplified Chinese113785689
    English35291320
    Numbers1179140
    Special Tokens106106
    Other Tokens959766
    Total211288021
  • Experimental Results:
    • Comparative performance on BERT‑base using the small dataset, with detailed metrics on AFQMC, TNEWS, IFLYTEK, CMNLI, etc.
  • Data Access:
    • Application Process: Applicants must submit the research purpose, planned usage, institution, and a brief introduction via email, and commit not to redistribute to third parties.
    • Email: CLUEbenchmark@163.com, Subject: CLUECorpus2020 100G Corpus.

CLUECorpusSmall (14 GB)

  • Use Cases: Suitable for language modeling, pre‑training, or generative tasks.
  • Size: Over 14 GB, nearly 4,000 well‑defined txt files, containing 5 billion characters.
  • Source: Mainly derived from the nlp_chinese_corpus project.
  • Format: One sentence per line, with blank lines separating documents.
  • Sub‑corpora:
    • News Corpus news2016zh_corpus: 8 GB, 2,000 small files.
    • Web Interaction Corpus webText2019zh_corpus: 3 GB, over 900 small files.
    • Wikipedia Corpus wiki2019zh_corpus: ~1.1 GB, about 300 small files.
    • Comments Corpus comments2019zh_corpus: ~2.3 GB, 784 small files.
  • Download Method: Provided via Baidu Cloud with password protection.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Pretraining Corpus

Source

Organization: github

Created: 11/25/2020

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.