M-AILABS Speech Dataset

The M‑AILABS Speech Dataset is the first large‑scale free dataset we provide for both speech recognition and speech synthesis training. The data are primarily derived from LibriVox and Project Gutenberg, containing nearly a thousand hours of audio and aligned text files. Each segment is transcribed, ranging from 1 to 20 seconds. Texts were published between 1884 and 1964 and are in the public domain. Audio recordings are also public domain from the LibriVox project, except for Ukrainian recordings, which are supplied by Nash Format or Gwara Media and are intended solely for machine‑learning use.

Updated 3/8/2024

github

Description

Dataset Overview

Dataset Name

The M‑AILABS Speech Dataset

Intended Use

For training speech recognition and speech synthesis models.

Data Sources

Text data: Project Gutenberg
Audio data: LibriVox (Ukrainian audio provided by Nash Format or Gwara Media)

Scale

Audio duration: close to 1 000 hours
Text format: pre‑processed for easy use
Audio segment length: 1–20 seconds

Copyright Information

Text release dates: 1884–1964
Copyright status: Public domain (Ukrainian audio excluded, for ML use only)

Download

The dataset can be downloaded from: https://www.caito.de/?p=242

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Speech Recognition

Speech Synthesis

Source

Organization: github

Created: 3/21/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →