Back to datasets
Dataset assetOpen Source CommunitySpeech RecognitionSpanish

my-speech-datasets

This dataset is used to train Mozilla's DeepSpeech model, containing public‑domain speech data, currently focusing on Spanish speech, with plans to add more languages. It includes 120 hours of clean Spanish speech and 100 hours of clean speech from a single speaker, in LJSpeech format.

Source
github
Created
Jun 4, 2019
Updated
Apr 28, 2024
Signals
138 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Data Collection Method

The dataset was created by automatically aligning transcripts with Windows speech‑recognition output and validating with Mozilla DeepSpeech models. Multiple language models were used during validation; the initial model was trained on voxforge Spanish data, followed by the model with the highest confidence scores on Windows speech‑recognition results.

Supported Languages

License

The dataset is released under a public‑domain license.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio