High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

roszcz/maestro-base-v2

The dataset named maestro‑base‑v2 is intended for music analysis. It includes three main features: `notes`, `control_changes`, and `source`. `notes` contain numeric fields for note end time, pitch, start time, and velocity. `control_changes` contain numeric fields for control number, time, and value. `source` is a string possibly indicating the music source. The dataset is split into validation (137 samples), test (177 samples), and train (962 samples). Total download size is 141,530,448 bytes; total size is 493,963,458 bytes.

hugging_face

View Details

IndustryCorpus2

Industry Model Training

Data Processing

This dataset is a high‑quality corpus for industry model training, covering 31 industry categories with both Chinese and English data: 1 TB of Chinese data and 2.2 TB of English data. The dataset has undergone source upgrades, industry taxonomy updates, semantic quality filtering, and tiered quality processing, resulting in three quality levels (high, medium, low) to suit different model training needs. Its primary aim is to improve industry model performance, facilitating intelligent transformation and innovative development in industry applications.

huggingface

View Details

IndustryCorpus_automobile

Automotive Industry

Data Processing

This dataset was constructed to address the shortage of industry‑specific training data, including insufficient data volume, low quality, and lack of domain expertise. By applying 22 industry data processing operators to over 100 TB of open‑source data, a high‑quality 3.4 TB multi‑industry Chinese‑English pre‑training dataset was extracted. The filtered data consist of 1 TB Chinese and 2.4 TB English texts, with the Chinese portion annotated with 12 label types. The dataset covers 18 industry categories (e.g., medical, education, literature, finance) and undergoes rule‑based and model‑based filtering as well as document‑level deduplication. It is partitioned into 18 industry‑specific subsets; the description below pertains to the automotive subset.

huggingface

View Details