IndustryCorpus_automobile
This dataset was constructed to address the shortage of industry‑specific training data, including insufficient data volume, low quality, and lack of domain expertise. By applying 22 industry data processing operators to over 100 TB of open‑source data, a high‑quality 3.4 TB multi‑industry Chinese‑English pre‑training dataset was extracted. The filtered data consist of 1 TB Chinese and 2.4 TB English texts, with the Chinese portion annotated with 12 label types. The dataset covers 18 industry categories (e.g., medical, education, literature, finance) and undergoes rule‑based and model‑based filtering as well as document‑level deduplication. It is partitioned into 18 industry‑specific subsets; the description below pertains to the automotive subset.
Description
Dataset Overview
Dataset Description
- Language: Chinese and English
- Data Size: 1 TB Chinese data, 2.4 TB English data
- Task Category: Text Generation
- Industry Categories: 18 industry categories, including Medical, Education, Literature, Finance, Tourism, Law, Sports, Automotive, News, etc.
Data Processing
- Data Sources: Filtered from over 100 TB of open‑source corpora, including WuDaoCorpora, BAAI‑CCI, redpajama, SkyPile‑150B
- Data Processing Operations: Applied 22 industry‑specific operators for cleaning and filtering
- Rule‑based Filtering: Traditional Chinese conversion, email removal, IP address removal, link removal, Unicode repair, etc.
- Model‑based Filtering: Utilized industry classification language models with an accuracy of 80 %
- Deduplication: Document‑level deduplication using MinHash
Data Annotation
- Chinese Data Labels: 12 label types such as alphanumeric ratio, average line length, language confidence score, maximum line length, perplexity, toxic character ratio, etc.
Dataset Performance Validation
- Model Training: Conducted continual pre‑training, SFT, and DPO training
- Performance Gains: Target performance increased by 20 %, subjective win rate 82 %
Industry Data Size
| Industry Category | Data Size (GB) | Industry Category | Data Size (GB) |
|---|---|---|---|
| Programming | 4.1 | Politics | 326.4 |
| Law | 274.6 | Mathematics | 5.9 |
| Education | 458.1 | Sports | 442 |
| Finance | 197.8 | Literature | 179.3 |
| Computer Science | 46.9 | News | 564.1 |
| Technology | 333.6 | Film & TV | 162.1 |
| Tourism | 82.5 | Medicine | 189.4 |
| Agriculture | 41.6 | Automotive | 40.8 |
| Sentiment | 31.7 | Artificial Intelligence | 5.6 |
| Total (GB) | 3386.5 |
Dataset Split
- Splitting Method: The large dataset is divided into 18 industry‑specific subsets; the current description refers to the automotive subset.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 7/25/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.