High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

my-speech-datasets

This dataset is used to train Mozilla's DeepSpeech model, containing public‑domain speech data, currently focusing on Spanish speech, with plans to add more languages. It includes 120 hours of clean Spanish speech and 100 hours of clean speech from a single speaker, in LJSpeech format.

github

View Details

curated_20k_spanish

Natural Language Processing

Spanish

This dataset includes a feature named 'messages', which is a list containing two sub‑features: 'content' (string) and 'role' (string). The dataset is divided into a training split (train) with 20,207 samples, totaling 48,020,454 bytes. The download size is 24,914,380 bytes, and it is licensed under Apache 2.0. The language is Spanish.

huggingface

View Details