ChineseWebText2.0
ChineseWebText 2.0 is a large‑scale high‑quality Chinese web‑text dataset containing 3.8 TB of data. Each text is accompanied by a quality score, single‑label and multi‑label domain tags, as well as toxicity classification and scores, enabling LLM researchers to select data based on new quality thresholds. The dataset was constructed and filtered using the MDFG‑tool, ensuring high data quality and multidimensional fine‑grained information.
Description
ChineseWebText 2.0 Dataset Overview
Dataset Summary
- Size: 3.8 TB
- Data Type: Chinese web text
- Characteristics:
- Includes quality scores
- Single‑label and multi‑label domain tags
- Toxicity classification and scores
Data Example
{
"text": "近日,黑龙江省高校校报协会第十四届学术年会暨校报工作交流研讨会在东北农业大学举行。我校10件新闻作品喜获2项一等奖,2项二等奖,6项三等奖……",
"domain": {
"single_label": "news",
"multi_label": ["news", "education"]
},
"toxicity": {
"label": 0,
"score": 1.0347155694034882e-05
},
"quality_score": 0.96044921875
}
Field Descriptions
- text: Text content
- single_label: Highest‑probability domain label from the classification model
- multi_label: All domain labels whose probabilities exceed the threshold
- label: Toxicity label produced by the toxicity classification model
- score: Toxicity score from the classification model
- quality_score: Quality score generated by the quality assessment model
Data Processing Tools
- MDFG‑tool: Toolchain for constructing large‑scale high‑quality Chinese datasets, encompassing coarse‑grained filtering, quality assessment, domain classification, and toxicity evaluation modules.
Data Analysis
Data Removal Rate
- Raw data size: 6.6 TB
- Post‑processing size: 3.8 TB
- Removal rate:
- Preparation stage: 32.32%
- Pre‑processing stage: 43.33%
Data Quality Distribution
- Quality score intervals:
- [0.2, 0.4): 18%
- [0.9, 1.0): 18%
- [0.1, 0.2): small amount
- Human acceptability:
- [0.5, 1.0): over 90%
- [0.1, 0.2): 85%
Domain Distribution
- Overall distribution:
- Encyclopedia: 33.43%
- General: 32.63%
- News: 28.01%
- Mathematics: 0.55%
- Quality‑related distribution:
- Domain distribution within each quality interval mirrors the overall distribution
Data Toxicity Analysis
- Toxicity score distribution:
- Non‑toxic data: 97.41%
- Toxic data: 3.16 GB (1,632,620 samples)
Citation
- When using the data or code, please cite the relevant papers.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 11/15/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.