C4
The C4 dataset, created by the China Information Processing Laboratory and other institutions, is a large unlabeled text corpus widely used for pre‑training large language models. It contains approximately 400 million cleaned text passages sourced from various high‑quality unstructured text resources. During creation, heuristic rules were applied to select well‑structured, valuable content, and data quality was enhanced by generating instructions and rewriting responses. The C4 dataset is mainly used for instruction tuning of large language models, aiming to improve zero‑shot learning and other NLP tasks.
Description
C4 Dataset Overview
C4 is a large-scale, unlabeled text corpus constructed by the China Information Processing Laboratory and collaborators. It serves as a primary resource for pre‑training massive language models.
Dataset Contents
- Scale: Approximately 400 million cleaned text passages.
- Sources: Diverse high‑quality unstructured text resources.
- Cleaning Process: Heuristic rules filter well‑structured, valuable content; additional instruction generation and response rewriting improve quality.
Primary Use Cases
- Instruction Tuning: Enhances zero‑shot and few‑shot performance of large language models.
- Research: Supports various NLP tasks requiring extensive textual knowledge.
Annotation Quality
- Expert Involvement: Annotations and quality checks are performed by domain experts to ensure reliability.
Limitations & Risks
- Potential Biases: Inherent biases from source documents may be present; users should be aware when applying the data.
Citation
@article{c4_dataset,
title = {C4: A Large-Scale Chinese Unlabeled Text Corpus},
author = {China Information Processing Laboratory et al.},
year = {2023},
note = {Dataset release}
}
Contact
For questions, please contact the China Information Processing Laboratory.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 8/20/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.