M-AILABS Speech Dataset
The M‑AILABS Speech Dataset is the first large‑scale free dataset we provide for both speech recognition and speech synthesis training. The data are primarily derived from LibriVox and Project Gutenberg, containing nearly a thousand hours of audio and aligned text files. Each segment is transcribed, ranging from 1 to 20 seconds. Texts were published between 1884 and 1964 and are in the public domain. Audio recordings are also public domain from the LibriVox project, except for Ukrainian recordings, which are supplied by Nash Format or Gwara Media and are intended solely for machine‑learning use.
Dataset description and usage context
Dataset Overview
Dataset Name
The M‑AILABS Speech Dataset
Intended Use
For training speech recognition and speech synthesis models.
Data Sources
- Text data: Project Gutenberg
- Audio data: LibriVox (Ukrainian audio provided by Nash Format or Gwara Media)
Scale
- Audio duration: close to 1 000 hours
- Text format: pre‑processed for easy use
- Audio segment length: 1–20 seconds
Copyright Information
- Text release dates: 1884–1964
- Copyright status: Public domain (Ukrainian audio excluded, for ML use only)
Download
The dataset can be downloaded from: https://www.caito.de/?p=242
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.