C4
The C4 dataset, created by the China Information Processing Laboratory and other institutions, is a large unlabeled text corpus widely used for pre‑training large language models. It contains approximately 400 million cleaned text passages sourced from various high‑quality unstructured text resources. During creation, heuristic rules were applied to select well‑structured, valuable content, and data quality was enhanced by generating instructions and rewriting responses. The C4 dataset is mainly used for instruction tuning of large language models, aiming to improve zero‑shot learning and other NLP tasks.
Dataset description and usage context
C4 Dataset Overview
C4 is a large-scale, unlabeled text corpus constructed by the China Information Processing Laboratory and collaborators. It serves as a primary resource for pre‑training massive language models.
Dataset Contents
- Scale: Approximately 400 million cleaned text passages.
- Sources: Diverse high‑quality unstructured text resources.
- Cleaning Process: Heuristic rules filter well‑structured, valuable content; additional instruction generation and response rewriting improve quality.
Primary Use Cases
- Instruction Tuning: Enhances zero‑shot and few‑shot performance of large language models.
- Research: Supports various NLP tasks requiring extensive textual knowledge.
Annotation Quality
- Expert Involvement: Annotations and quality checks are performed by domain experts to ensure reliability.
Limitations & Risks
- Potential Biases: Inherent biases from source documents may be present; users should be aware when applying the data.
Citation
@article{c4_dataset,
title = {C4: A Large-Scale Chinese Unlabeled Text Corpus},
author = {China Information Processing Laboratory et al.},
year = {2023},
note = {Dataset release}
}
Contact
For questions, please contact the China Information Processing Laboratory.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.