JUHE API Marketplace
DATASET
Classic Dataset

M-AILABS Speech Dataset

The M‑AILABS Speech Dataset is the first large‑scale free dataset we provide for both speech recognition and speech synthesis training. The data are primarily derived from LibriVox and Project Gutenberg, containing nearly a thousand hours of audio and aligned text files. Each segment is transcribed, ranging from 1 to 20 seconds. Texts were published between 1884 and 1964 and are in the public domain. Audio recordings are also public domain from the LibriVox project, except for Ukrainian recordings, which are supplied by Nash Format or Gwara Media and are intended solely for machine‑learning use.

Updated 3/8/2024
github

Description

Dataset Overview

Dataset Name

The M‑AILABS Speech Dataset

Intended Use

For training speech recognition and speech synthesis models.

Data Sources

  • Text data: Project Gutenberg
  • Audio data: LibriVox (Ukrainian audio provided by Nash Format or Gwara Media)

Scale

  • Audio duration: close to 1 000 hours
  • Text format: pre‑processed for easy use
  • Audio segment length: 1–20 seconds

Copyright Information

  • Text release dates: 1884–1964
  • Copyright status: Public domain (Ukrainian audio excluded, for ML use only)

Download

The dataset can be downloaded from: https://www.caito.de/?p=242

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Speech Recognition
Speech Synthesis

Source

Organization: github

Created: 3/21/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.