Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingLanguage Models
ChineseWebText2.0
ChineseWebText 2.0 is a large‑scale high‑quality Chinese web‑text dataset containing 3.8 TB of data. Each text is accompanied by a quality score, single‑label and multi‑label domain tags, as well as toxicity classification and scores, enabling LLM researchers to select data based on new quality thresholds. The dataset was constructed and filtered using the MDFG‑tool, ensuring high data quality and multidimensional fine‑grained information.
Source
huggingface
Created
Nov 15, 2024
Updated
Nov 27, 2024
Signals
224 views
Availability
Linked source ready
Overview
Dataset description and usage context
ChineseWebText 2.0 Dataset Overview
Dataset Summary
- Size: 3.8 TB
- Data Type: Chinese web text
- Characteristics:
- Includes quality scores
- Single‑label and multi‑label domain tags
- Toxicity classification and scores
Data Example
{
"text": "近日,黑龙江省高校校报协会第十四届学术年会暨校报工作交流研讨会在东北农业大学举行。我校10件新闻作品喜获2项一等奖,2项二等奖,6项三等奖……",
"domain": {
"single_label": "news",
"multi_label": ["news", "education"]
},
"toxicity": {
"label": 0,
"score": 1.0347155694034882e-05
},
"quality_score": 0.96044921875
}
Field Descriptions
- text: Text content
- single_label: Highest‑probability domain label from the classification model
- multi_label: All domain labels whose probabilities exceed the threshold
- label: Toxicity label produced by the toxicity classification model
- score: Toxicity score from the classification model
- quality_score: Quality score generated by the quality assessment model
Data Processing Tools
- MDFG‑tool: Toolchain for constructing large‑scale high‑quality Chinese datasets, encompassing coarse‑grained filtering, quality assessment, domain classification, and toxicity evaluation modules.
Data Analysis
Data Removal Rate
- Raw data size: 6.6 TB
- Post‑processing size: 3.8 TB
- Removal rate:
- Preparation stage: 32.32%
- Pre‑processing stage: 43.33%
Data Quality Distribution
- Quality score intervals:
- [0.2, 0.4): 18%
- [0.9, 1.0): 18%
- [0.1, 0.2): small amount
- Human acceptability:
- [0.5, 1.0): over 90%
- [0.1, 0.2): 85%
Domain Distribution
- Overall distribution:
- Encyclopedia: 33.43%
- General: 32.63%
- News: 28.01%
- Mathematics: 0.55%
- Quality‑related distribution:
- Domain distribution within each quality interval mirrors the overall distribution
Data Toxicity Analysis
- Toxicity score distribution:
- Non‑toxic data: 97.41%
- Toxic data: 3.16 GB (1,632,620 samples)
Citation
- When using the data or code, please cite the relevant papers.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.