IndustryCorpus_automobile
This dataset was constructed to address the shortage of industry‑specific training data, including insufficient data volume, low quality, and lack of domain expertise. By applying 22 industry data processing operators to over 100 TB of open‑source data, a high‑quality 3.4 TB multi‑industry Chinese‑English pre‑training dataset was extracted. The filtered data consist of 1 TB Chinese and 2.4 TB English texts, with the Chinese portion annotated with 12 label types. The dataset covers 18 industry categories (e.g., medical, education, literature, finance) and undergoes rule‑based and model‑based filtering as well as document‑level deduplication. It is partitioned into 18 industry‑specific subsets; the description below pertains to the automotive subset.
Dataset description and usage context
Dataset Overview
Dataset Description
- Language: Chinese and English
- Data Size: 1 TB Chinese data, 2.4 TB English data
- Task Category: Text Generation
- Industry Categories: 18 industry categories, including Medical, Education, Literature, Finance, Tourism, Law, Sports, Automotive, News, etc.
Data Processing
- Data Sources: Filtered from over 100 TB of open‑source corpora, including WuDaoCorpora, BAAI‑CCI, redpajama, SkyPile‑150B
- Data Processing Operations: Applied 22 industry‑specific operators for cleaning and filtering
- Rule‑based Filtering: Traditional Chinese conversion, email removal, IP address removal, link removal, Unicode repair, etc.
- Model‑based Filtering: Utilized industry classification language models with an accuracy of 80 %
- Deduplication: Document‑level deduplication using MinHash
Data Annotation
- Chinese Data Labels: 12 label types such as alphanumeric ratio, average line length, language confidence score, maximum line length, perplexity, toxic character ratio, etc.
Dataset Performance Validation
- Model Training: Conducted continual pre‑training, SFT, and DPO training
- Performance Gains: Target performance increased by 20 %, subjective win rate 82 %
Industry Data Size
| Industry Category | Data Size (GB) | Industry Category | Data Size (GB) |
|---|---|---|---|
| Programming | 4.1 | Politics | 326.4 |
| Law | 274.6 | Mathematics | 5.9 |
| Education | 458.1 | Sports | 442 |
| Finance | 197.8 | Literature | 179.3 |
| Computer Science | 46.9 | News | 564.1 |
| Technology | 333.6 | Film & TV | 162.1 |
| Tourism | 82.5 | Medicine | 189.4 |
| Agriculture | 41.6 | Automotive | 40.8 |
| Sentiment | 31.7 | Artificial Intelligence | 5.6 |
| Total (GB) | 3386.5 |
Dataset Split
- Splitting Method: The large dataset is divided into 18 industry‑specific subsets; the current description refers to the automotive subset.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.