JUHE API Marketplace
DATASET
Open Source Community

C4

The C4 dataset, created by the China Information Processing Laboratory and other institutions, is a large unlabeled text corpus widely used for pre‑training large language models. It contains approximately 400 million cleaned text passages sourced from various high‑quality unstructured text resources. During creation, heuristic rules were applied to select well‑structured, valuable content, and data quality was enhanced by generating instructions and rewriting responses. The C4 dataset is mainly used for instruction tuning of large language models, aiming to improve zero‑shot learning and other NLP tasks.

Updated 8/20/2024
arXiv

Description

C4 Dataset Overview

C4 is a large-scale, unlabeled text corpus constructed by the China Information Processing Laboratory and collaborators. It serves as a primary resource for pre‑training massive language models.

Dataset Contents

  • Scale: Approximately 400 million cleaned text passages.
  • Sources: Diverse high‑quality unstructured text resources.
  • Cleaning Process: Heuristic rules filter well‑structured, valuable content; additional instruction generation and response rewriting improve quality.

Primary Use Cases

  • Instruction Tuning: Enhances zero‑shot and few‑shot performance of large language models.
  • Research: Supports various NLP tasks requiring extensive textual knowledge.

Annotation Quality

  • Expert Involvement: Annotations and quality checks are performed by domain experts to ensure reliability.

Limitations & Risks

  • Potential Biases: Inherent biases from source documents may be present; users should be aware when applying the data.

Citation

@article{c4_dataset,
  title   = {C4: A Large-Scale Chinese Unlabeled Text Corpus},
  author  = {China Information Processing Laboratory et al.},
  year    = {2023},
  note    = {Dataset release}
}

Contact

For questions, please contact the China Information Processing Laboratory.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Pre‑training Models

Source

Organization: arXiv

Created: 8/20/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.