Back to datasets
Dataset assetOpen Source CommunityData ProcessingIndustry Model Training

IndustryCorpus2

This dataset is a high‑quality corpus for industry model training, covering 31 industry categories with both Chinese and English data: 1 TB of Chinese data and 2.2 TB of English data. The dataset has undergone source upgrades, industry taxonomy updates, semantic quality filtering, and tiered quality processing, resulting in three quality levels (high, medium, low) to suit different model training needs. Its primary aim is to improve industry model performance, facilitating intelligent transformation and innovative development in industry applications.

Source
huggingface
Created
Sep 15, 2024
Updated
Sep 23, 2024
Signals
592 views
Availability
Linked source ready
Overview

Dataset description and usage context

IndustryCorpus2 Dataset Overview

Basic Information

  • License: Apache 2.0
  • Languages: Chinese, English
  • Data Scale:
    • Chinese: 1 TB
    • English: 2.2 TB

Updates & Iterations

  • Data Sources: Added high‑quality sources such as Pile, BigCode, Open‑Web‑Math, etc., for mathematical and code data.
  • Industry Taxonomy: Combined the National Bureau of Statistics’ national economic industry classification (20 classes) with the World Knowledge System to redesign industry categories, establishing 31 categories covering most mainstream industries.
  • Semantic Quality Filtering: Employed rule‑based + model‑based filtering to substantially raise overall data quality.
  • Quality Tiering: Organized data into high, medium, and low tiers based on quality assessment scores to match various model training requirements.

Industry Data Distribution

  • Total Size: 3,276 GB
  • Major Industry Distribution:
    • Academic Education: 340.9 GB
    • Sports: 262.5 GB
    • Politics‑Government‑Administration: 271.5 GB
    • Law‑Judiciary: 238.5 GB
    • Medicine‑Health‑Psychology‑TCM: 271.7 GB
    • Film‑Entertainment: 209.4 GB

Quality Tier Distribution

  • Trend: Chinese and English data share similar quality distributions: medium quality is most abundant, followed by high, with low quality being minimal. English data has a higher proportion of high‑quality samples.

Category Classification

  • Number of Categories: 31
  • Data Construction:
    • Sources: Pre‑training corpus sampling (90 %) and open‑source text classification data (10 %).
    • Labeling: LLM models performed multiple rounds of classification; only samples with consistent judgments were kept.
    • Scale: 36 K entries.

Quality Evaluation

  • Low‑Quality Filtering: Extremely low‑quality data were removed, leaving three independent groups (low, medium, high) for targeted model training.
  • Construction Details:
    • Sources: Random sampling from pre‑training corpora.
    • Labeling: Designed scoring rules, multiple LLM rating rounds, selecting samples with rating variance < 2.
    • Scale: 20 K rated samples, Chinese‑English ratio 1:1.

Model Training

  • Model Choice: 0.5 B‑scale model; compared beg‑m3 and qwen‑0.5b; experiments showed bge‑m3 performed best overall.
  • Hyper‑parameters: base bge‑m3, full‑parameter training, lr = 1e‑5, batch = 64, max_length = 2048.
  • Evaluation: On validation set, model and GPT‑4 agreed on sample quality judgments 90 % of the time.

Benefits of High‑Quality Data Training

  • Efficiency: Models trained on high‑quality data reached the performance of 50 B‑token models after only 14 B tokens.
  • Effectiveness: Adding filtered high‑quality and instruction data during the annealing phase noticeably improved model performance.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio