JUHE API Marketplace
DATASET
Open Source Community

IndustryCorpus_automobile

This dataset was constructed to address the shortage of industry‑specific training data, including insufficient data volume, low quality, and lack of domain expertise. By applying 22 industry data processing operators to over 100 TB of open‑source data, a high‑quality 3.4 TB multi‑industry Chinese‑English pre‑training dataset was extracted. The filtered data consist of 1 TB Chinese and 2.4 TB English texts, with the Chinese portion annotated with 12 label types. The dataset covers 18 industry categories (e.g., medical, education, literature, finance) and undergoes rule‑based and model‑based filtering as well as document‑level deduplication. It is partitioned into 18 industry‑specific subsets; the description below pertains to the automotive subset.

Updated 7/26/2024
huggingface

Description

Dataset Overview

Dataset Description

  • Language: Chinese and English
  • Data Size: 1 TB Chinese data, 2.4 TB English data
  • Task Category: Text Generation
  • Industry Categories: 18 industry categories, including Medical, Education, Literature, Finance, Tourism, Law, Sports, Automotive, News, etc.

Data Processing

  • Data Sources: Filtered from over 100 TB of open‑source corpora, including WuDaoCorpora, BAAI‑CCI, redpajama, SkyPile‑150B
  • Data Processing Operations: Applied 22 industry‑specific operators for cleaning and filtering
  • Rule‑based Filtering: Traditional Chinese conversion, email removal, IP address removal, link removal, Unicode repair, etc.
  • Model‑based Filtering: Utilized industry classification language models with an accuracy of 80 %
  • Deduplication: Document‑level deduplication using MinHash

Data Annotation

  • Chinese Data Labels: 12 label types such as alphanumeric ratio, average line length, language confidence score, maximum line length, perplexity, toxic character ratio, etc.

Dataset Performance Validation

  • Model Training: Conducted continual pre‑training, SFT, and DPO training
  • Performance Gains: Target performance increased by 20 %, subjective win rate 82 %

Industry Data Size

Industry CategoryData Size (GB)Industry CategoryData Size (GB)
Programming4.1Politics326.4
Law274.6Mathematics5.9
Education458.1Sports442
Finance197.8Literature179.3
Computer Science46.9News564.1
Technology333.6Film & TV162.1
Tourism82.5Medicine189.4
Agriculture41.6Automotive40.8
Sentiment31.7Artificial Intelligence5.6
Total (GB)3386.5

Dataset Split

  • Splitting Method: The large dataset is divided into 18 industry‑specific subsets; the current description refers to the automotive subset.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Automotive Industry
Data Processing

Source

Organization: huggingface

Created: 7/25/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.