M-AILABS Speech Dataset
The M‑AILABS Speech Dataset is the first large‑scale free dataset we provide for both speech recognition and speech synthesis training. The data are primarily derived from LibriVox and Project Gutenberg, containing nearly a thousand hours of audio and aligned text files. Each segment is transcribed, ranging from 1 to 20 seconds. Texts were published between 1884 and 1964 and are in the public domain. Audio recordings are also public domain from the LibriVox project, except for Ukrainian recordings, which are supplied by Nash Format or Gwara Media and are intended solely for machine‑learning use.
Description
Dataset Overview
Dataset Name
The M‑AILABS Speech Dataset
Intended Use
For training speech recognition and speech synthesis models.
Data Sources
- Text data: Project Gutenberg
- Audio data: LibriVox (Ukrainian audio provided by Nash Format or Gwara Media)
Scale
- Audio duration: close to 1 000 hours
- Text format: pre‑processed for easy use
- Audio segment length: 1–20 seconds
Copyright Information
- Text release dates: 1884–1964
- Copyright status: Public domain (Ukrainian audio excluded, for ML use only)
Download
The dataset can be downloaded from: https://www.caito.de/?p=242
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 3/21/2019
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.