IndustryCorpus2
This dataset is a high‑quality corpus for industry model training, covering 31 industry categories with both Chinese and English data: 1 TB of Chinese data and 2.2 TB of English data. The dataset has undergone source upgrades, industry taxonomy updates, semantic quality filtering, and tiered quality processing, resulting in three quality levels (high, medium, low) to suit different model training needs. Its primary aim is to improve industry model performance, facilitating intelligent transformation and innovative development in industry applications.
Description
IndustryCorpus2 Dataset Overview
Basic Information
- License: Apache 2.0
- Languages: Chinese, English
- Data Scale:
- Chinese: 1 TB
- English: 2.2 TB
Updates & Iterations
- Data Sources: Added high‑quality sources such as Pile, BigCode, Open‑Web‑Math, etc., for mathematical and code data.
- Industry Taxonomy: Combined the National Bureau of Statistics’ national economic industry classification (20 classes) with the World Knowledge System to redesign industry categories, establishing 31 categories covering most mainstream industries.
- Semantic Quality Filtering: Employed rule‑based + model‑based filtering to substantially raise overall data quality.
- Quality Tiering: Organized data into high, medium, and low tiers based on quality assessment scores to match various model training requirements.
Industry Data Distribution
- Total Size: 3,276 GB
- Major Industry Distribution:
- Academic Education: 340.9 GB
- Sports: 262.5 GB
- Politics‑Government‑Administration: 271.5 GB
- Law‑Judiciary: 238.5 GB
- Medicine‑Health‑Psychology‑TCM: 271.7 GB
- Film‑Entertainment: 209.4 GB
Quality Tier Distribution
- Trend: Chinese and English data share similar quality distributions: medium quality is most abundant, followed by high, with low quality being minimal. English data has a higher proportion of high‑quality samples.
Category Classification
- Number of Categories: 31
- Data Construction:
- Sources: Pre‑training corpus sampling (90 %) and open‑source text classification data (10 %).
- Labeling: LLM models performed multiple rounds of classification; only samples with consistent judgments were kept.
- Scale: 36 K entries.
Quality Evaluation
- Low‑Quality Filtering: Extremely low‑quality data were removed, leaving three independent groups (low, medium, high) for targeted model training.
- Construction Details:
- Sources: Random sampling from pre‑training corpora.
- Labeling: Designed scoring rules, multiple LLM rating rounds, selecting samples with rating variance < 2.
- Scale: 20 K rated samples, Chinese‑English ratio 1:1.
Model Training
- Model Choice: 0.5 B‑scale model; compared beg‑m3 and qwen‑0.5b; experiments showed bge‑m3 performed best overall.
- Hyper‑parameters: base bge‑m3, full‑parameter training, lr = 1e‑5, batch = 64, max_length = 2048.
- Evaluation: On validation set, model and GPT‑4 agreed on sample quality judgments 90 % of the time.
Benefits of High‑Quality Data Training
- Efficiency: Models trained on high‑quality data reached the performance of 50 B‑token models after only 14 B tokens.
- Effectiveness: Adding filtered high‑quality and instruction data during the annealing phase noticeably improved model performance.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 9/15/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.