Dataset assetOpen Source CommunityNatural Language ProcessingPre‑training Models

C4

The C4 dataset, created by the China Information Processing Laboratory and other institutions, is a large unlabeled text corpus widely used for pre‑training large language models. It contains approximately 400 million cleaned text passages sourced from various high‑quality unstructured text resources. During creation, heuristic rules were applied to select well‑structured, valuable content, and data quality was enhanced by generating instructions and rewriting responses. The C4 dataset is mainly used for instruction tuning of large language models, aiming to improve zero‑shot learning and other NLP tasks.

Source

arXiv

Created

Aug 20, 2024

Updated

Aug 20, 2024

Signals

556 views

Availability

Linked source ready

Overview

Dataset description and usage context

C4 Dataset Overview

C4 is a large-scale, unlabeled text corpus constructed by the China Information Processing Laboratory and collaborators. It serves as a primary resource for pre‑training massive language models.

Dataset Contents

Scale: Approximately 400 million cleaned text passages.
Sources: Diverse high‑quality unstructured text resources.
Cleaning Process: Heuristic rules filter well‑structured, valuable content; additional instruction generation and response rewriting improve quality.

Primary Use Cases

Instruction Tuning: Enhances zero‑shot and few‑shot performance of large language models.
Research: Supports various NLP tasks requiring extensive textual knowledge.

Annotation Quality

Expert Involvement: Annotations and quality checks are performed by domain experts to ensure reliability.

Limitations & Risks

Potential Biases: Inherent biases from source documents may be present; users should be aware when applying the data.

Citation

@article{c4_dataset,
  title   = {C4: A Large-Scale Chinese Unlabeled Text Corpus},
  author  = {China Information Processing Laboratory et al.},
  year    = {2023},
  note    = {Dataset release}
}

Contact

For questions, please contact the China Information Processing Laboratory.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio