Explore high-quality datasets for your AI and machine learning projects.
The C4 dataset, created by the China Information Processing Laboratory and other institutions, is a large unlabeled text corpus widely used for pre‑training large language models. It contains approximately 400 million cleaned text passages sourced from various high‑quality unstructured text resources. During creation, heuristic rules were applied to select well‑structured, valuable content, and data quality was enhanced by generating instructions and rewriting responses. The C4 dataset is mainly used for instruction tuning of large language models, aiming to improve zero‑shot learning and other NLP tasks.