Explore high-quality datasets for your AI and machine learning projects.
This dataset is used to train Mozilla's DeepSpeech model, containing public‑domain speech data, currently focusing on Spanish speech, with plans to add more languages. It includes 120 hours of clean Spanish speech and 100 hours of clean speech from a single speaker, in LJSpeech format.
This dataset includes a feature named 'messages', which is a list containing two sub‑features: 'content' (string) and 'role' (string). The dataset is divided into a training split (train) with 20,207 samples, totaling 48,020,454 bytes. The download size is 24,914,380 bytes, and it is licensed under Apache 2.0. The language is Spanish.