IndustryCorpus2
Data ProcessingIndustry Model Training
This dataset is a high‑quality corpus for industry model training, covering 31 industry categories with both Chinese and English data: 1 TB of Chinese data and 2.2 TB of English data. The dataset has undergone source upgrades, industry taxonomy updates, semantic quality filtering, and tiered quality processing, resulting in three quality levels (high, medium, low) to suit different model training needs. Its primary aim is to improve industry model performance, facilitating intelligent transformation and innovative development in industry applications.
Source huggingfaceUpdated Sep 23, 2024592 viewsLinked
Inspect dataset