JUHE API Marketplace
DATASET
Open Source Community

my-speech-datasets

This dataset is used to train Mozilla's DeepSpeech model, containing public‑domain speech data, currently focusing on Spanish speech, with plans to add more languages. It includes 120 hours of clean Spanish speech and 100 hours of clean speech from a single speaker, in LJSpeech format.

Updated 4/28/2024
github

Description

Dataset Overview

Data Collection Method

The dataset was created by automatically aligning transcripts with Windows speech‑recognition output and validating with Mozilla DeepSpeech models. Multiple language models were used during validation; the initial model was trained on voxforge Spanish data, followed by the model with the highest confidence scores on Windows speech‑recognition results.

Supported Languages

License

The dataset is released under a public‑domain license.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Speech Recognition
Spanish

Source

Organization: github

Created: 6/4/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.