my-speech-datasets

This dataset is used to train Mozilla's DeepSpeech model, containing public‑domain speech data, currently focusing on Spanish speech, with plans to add more languages. It includes 120 hours of clean Spanish speech and 100 hours of clean speech from a single speaker, in LJSpeech format.

Updated 4/28/2024

github

Dataset Overview

Data Collection Method

The dataset was created by automatically aligning transcripts with Windows speech‑recognition output and validating with Mozilla DeepSpeech models. Multiple language models were used during validation; the initial model was trained on voxforge Spanish data, followed by the model with the highest confidence scores on Windows speech‑recognition results.

Supported Languages

Spanish
- 120 hours of clean speech data, address: 120h of clean speech
- 100 hours of clean speech from a single speaker, LJSpeech format, address: 100h of clean speech from a single speaker

License

The dataset is released under a public‑domain license.

my-speech-datasets

Description

Dataset Overview

Data Collection Method

Supported Languages

License

AI studio

Access Dataset

Topics

Source