Dataset assetOpen Source CommunityData ProcessingIndustry Model Training

IndustryCorpus2

This dataset is a high‑quality corpus for industry model training, covering 31 industry categories with both Chinese and English data: 1 TB of Chinese data and 2.2 TB of English data. The dataset has undergone source upgrades, industry taxonomy updates, semantic quality filtering, and tiered quality processing, resulting in three quality levels (high, medium, low) to suit different model training needs. Its primary aim is to improve industry model performance, facilitating intelligent transformation and innovative development in industry applications.

Source

huggingface

Created

Sep 15, 2024

Updated

Sep 23, 2024

Signals

592 views

Availability

Linked source ready

Overview

Dataset description and usage context

IndustryCorpus2 Dataset Overview

Basic Information

License: Apache 2.0
Languages: Chinese, English
Data Scale:
- Chinese: 1 TB
- English: 2.2 TB

Updates & Iterations

Data Sources: Added high‑quality sources such as Pile, BigCode, Open‑Web‑Math, etc., for mathematical and code data.
Industry Taxonomy: Combined the National Bureau of Statistics’ national economic industry classification (20 classes) with the World Knowledge System to redesign industry categories, establishing 31 categories covering most mainstream industries.
Semantic Quality Filtering: Employed rule‑based + model‑based filtering to substantially raise overall data quality.
Quality Tiering: Organized data into high, medium, and low tiers based on quality assessment scores to match various model training requirements.

Industry Data Distribution

Total Size: 3,276 GB
Major Industry Distribution:
- Academic Education: 340.9 GB
- Sports: 262.5 GB
- Politics‑Government‑Administration: 271.5 GB
- Law‑Judiciary: 238.5 GB
- Medicine‑Health‑Psychology‑TCM: 271.7 GB
- Film‑Entertainment: 209.4 GB

Quality Tier Distribution

Trend: Chinese and English data share similar quality distributions: medium quality is most abundant, followed by high, with low quality being minimal. English data has a higher proportion of high‑quality samples.

Category Classification

Number of Categories: 31
Data Construction:
- Sources: Pre‑training corpus sampling (90 %) and open‑source text classification data (10 %).
- Labeling: LLM models performed multiple rounds of classification; only samples with consistent judgments were kept.
- Scale: 36 K entries.

Quality Evaluation

Low‑Quality Filtering: Extremely low‑quality data were removed, leaving three independent groups (low, medium, high) for targeted model training.
Construction Details:
- Sources: Random sampling from pre‑training corpora.
- Labeling: Designed scoring rules, multiple LLM rating rounds, selecting samples with rating variance < 2.
- Scale: 20 K rated samples, Chinese‑English ratio 1:1.

Model Training

Model Choice: 0.5 B‑scale model; compared beg‑m3 and qwen‑0.5b; experiments showed bge‑m3 performed best overall.
Hyper‑parameters: base bge‑m3, full‑parameter training, lr = 1e‑5, batch = 64, max_length = 2048.
Evaluation: On validation set, model and GPT‑4 agreed on sample quality judgments 90 % of the time.

Benefits of High‑Quality Data Training

Efficiency: Models trained on high‑quality data reached the performance of 50 B‑token models after only 14 B tokens.
Effectiveness: Adding filtered high‑quality and instruction data during the annealing phase noticeably improved model performance.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio