Back to datasets
Dataset assetClassic DatasetSpeech RecognitionSpeech Synthesis

M-AILABS Speech Dataset

The M‑AILABS Speech Dataset is the first large‑scale free dataset we provide for both speech recognition and speech synthesis training. The data are primarily derived from LibriVox and Project Gutenberg, containing nearly a thousand hours of audio and aligned text files. Each segment is transcribed, ranging from 1 to 20 seconds. Texts were published between 1884 and 1964 and are in the public domain. Audio recordings are also public domain from the LibriVox project, except for Ukrainian recordings, which are supplied by Nash Format or Gwara Media and are intended solely for machine‑learning use.

Source
github
Created
Mar 21, 2019
Updated
Mar 8, 2024
Signals
1,460 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

The M‑AILABS Speech Dataset

Intended Use

For training speech recognition and speech synthesis models.

Data Sources

  • Text data: Project Gutenberg
  • Audio data: LibriVox (Ukrainian audio provided by Nash Format or Gwara Media)

Scale

  • Audio duration: close to 1 000 hours
  • Text format: pre‑processed for easy use
  • Audio segment length: 1–20 seconds

Copyright Information

  • Text release dates: 1884–1964
  • Copyright status: Public domain (Ukrainian audio excluded, for ML use only)

Download

The dataset can be downloaded from: https://www.caito.de/?p=242

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio