JUHE API Marketplace
DATASET
Open Source Community

ChineseWebText2.0

ChineseWebText 2.0 is a large‑scale high‑quality Chinese web‑text dataset containing 3.8 TB of data. Each text is accompanied by a quality score, single‑label and multi‑label domain tags, as well as toxicity classification and scores, enabling LLM researchers to select data based on new quality thresholds. The dataset was constructed and filtered using the MDFG‑tool, ensuring high data quality and multidimensional fine‑grained information.

Updated 11/27/2024
huggingface

Description

ChineseWebText 2.0 Dataset Overview

Dataset Summary

  • Size: 3.8 TB
  • Data Type: Chinese web text
  • Characteristics:
    • Includes quality scores
    • Single‑label and multi‑label domain tags
    • Toxicity classification and scores

Data Example

{
  "text": "近日,黑龙江省高校校报协会第十四届学术年会暨校报工作交流研讨会在东北农业大学举行。我校10件新闻作品喜获2项一等奖,2项二等奖,6项三等奖……",
  "domain": {
    "single_label": "news",
    "multi_label": ["news", "education"]
  },
  "toxicity": {
    "label": 0,
    "score": 1.0347155694034882e-05
  },
  "quality_score": 0.96044921875
}

Field Descriptions

  • text: Text content
  • single_label: Highest‑probability domain label from the classification model
  • multi_label: All domain labels whose probabilities exceed the threshold
  • label: Toxicity label produced by the toxicity classification model
  • score: Toxicity score from the classification model
  • quality_score: Quality score generated by the quality assessment model

Data Processing Tools

  • MDFG‑tool: Toolchain for constructing large‑scale high‑quality Chinese datasets, encompassing coarse‑grained filtering, quality assessment, domain classification, and toxicity evaluation modules.

Data Analysis

Data Removal Rate

  • Raw data size: 6.6 TB
  • Post‑processing size: 3.8 TB
  • Removal rate:
    • Preparation stage: 32.32%
    • Pre‑processing stage: 43.33%

Data Quality Distribution

  • Quality score intervals:
    • [0.2, 0.4): 18%
    • [0.9, 1.0): 18%
    • [0.1, 0.2): small amount
  • Human acceptability:
    • [0.5, 1.0): over 90%
    • [0.1, 0.2): 85%

Domain Distribution

  • Overall distribution:
    • Encyclopedia: 33.43%
    • General: 32.63%
    • News: 28.01%
    • Mathematics: 0.55%
  • Quality‑related distribution:
    • Domain distribution within each quality interval mirrors the overall distribution

Data Toxicity Analysis

  • Toxicity score distribution:
    • Non‑toxic data: 97.41%
    • Toxic data: 3.16 GB (1,632,620 samples)

Citation

  • When using the data or code, please cite the relevant papers.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Language Models

Source

Organization: huggingface

Created: 11/15/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.