IndustryCorpus_automobile

This dataset was constructed to address the shortage of industry‑specific training data, including insufficient data volume, low quality, and lack of domain expertise. By applying 22 industry data processing operators to over 100 TB of open‑source data, a high‑quality 3.4 TB multi‑industry Chinese‑English pre‑training dataset was extracted. The filtered data consist of 1 TB Chinese and 2.4 TB English texts, with the Chinese portion annotated with 12 label types. The dataset covers 18 industry categories (e.g., medical, education, literature, finance) and undergoes rule‑based and model‑based filtering as well as document‑level deduplication. It is partitioned into 18 industry‑specific subsets; the description below pertains to the automotive subset.

Updated 7/26/2024

huggingface

Description

Dataset Overview

Dataset Description

Language: Chinese and English
Data Size: 1 TB Chinese data, 2.4 TB English data
Task Category: Text Generation
Industry Categories: 18 industry categories, including Medical, Education, Literature, Finance, Tourism, Law, Sports, Automotive, News, etc.

Data Processing

Data Sources: Filtered from over 100 TB of open‑source corpora, including WuDaoCorpora, BAAI‑CCI, redpajama, SkyPile‑150B
Data Processing Operations: Applied 22 industry‑specific operators for cleaning and filtering
Rule‑based Filtering: Traditional Chinese conversion, email removal, IP address removal, link removal, Unicode repair, etc.
Model‑based Filtering: Utilized industry classification language models with an accuracy of 80 %
Deduplication: Document‑level deduplication using MinHash

Data Annotation

Chinese Data Labels: 12 label types such as alphanumeric ratio, average line length, language confidence score, maximum line length, perplexity, toxic character ratio, etc.

Dataset Performance Validation

Model Training: Conducted continual pre‑training, SFT, and DPO training
Performance Gains: Target performance increased by 20 %, subjective win rate 82 %

Industry Data Size

Industry Category	Data Size (GB)	Industry Category	Data Size (GB)
Programming	4.1	Politics	326.4
Law	274.6	Mathematics	5.9
Education	458.1	Sports	442
Finance	197.8	Literature	179.3
Computer Science	46.9	News	564.1
Technology	333.6	Film & TV	162.1
Tourism	82.5	Medicine	189.4
Agriculture	41.6	Automotive	40.8
Sentiment	31.7	Artificial Intelligence	5.6
Total (GB)	3386.5

Dataset Split

Splitting Method: The large dataset is divided into 18 industry‑specific subsets; the current description refers to the automotive subset.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Automotive Industry

Data Processing

Source

Organization: huggingface

Created: 7/25/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →