JUHE API Marketplace
DATASET
Open Source Community

AudioDataset

A repository containing various audio datasets, including speech, music, and audio mixture datasets. Speech datasets such as VCTK and LibriSpeech, music dataset such as StarNet, and audio mixture datasets such as Libri2Mix and Divide and Remaster (DnR).

Updated 5/14/2024
github

Description

Dataset Overview

Speech Datasets

  1. VCTK v0.92

    • Contains 110 English speakers with various accents, each reading approximately 400 sentences.
    • All recordings converted to 48 kHz.
    • Training set size: To be determined.
  2. VoiceBank-DEMAND

    • Single-channel human speech, 48 kHz.
    • Speech with noise, where noise is sourced from the DEMAND dataset.
    • 28‑speaker version:
      • Training set includes 11,572 utterances (≈9.4 hours).
      • Test set includes 824 utterances (≈0.6 hours).
      • Validation set (separated from training) includes 770 utterances (≈0.6 hours), remaining training set includes 10,802 utterances (≈8.8 hours).
    • 56‑speaker version: To be determined.
  3. LibriSpeech

    • Approximately 1000 hours of 16kHz read English speech corpus.
    • Training set: 100h + 360h + 500h (total 960h).
    • Development and test set details: To be determined.
  4. DNS-Challenge 5

    • Contains multilingual clean speech and various noises.

Music Datasets

  1. StarNet
    • Contains 104 classical music tracks at 48 kHz, sourced from corresponding free MIDI files.
    • Track details:
      • xxx.0.wav: Clarinet–vibraslap mix
      • xxx.1.wav: Clarinet track
      • xxx.2.wav: Vibraslap track
      • xxx.3.wav: String–piano mix
      • xxx.4.wav: String track
      • xxx.5.wav: Piano track

Audio Mixture Datasets

  1. Libri2Mix

    • Details: To be determined.
  2. Divide and Remaster (DnR)

    • Single-channel mixed audio, containing speech, music, and sound effects/background tracks, 44.1 kHz.
    • Audio sources: LibriSpeech (speech), Free Music Archive (music), FSD50k (sound effects).
    • Training set includes 3,406 mixes (≈57 hours).
    • Validation set includes 487 mixes (≈8 hours).
    • Test set includes 973 mixes (≈16 hours).
  3. MUSAN

    • Corpus containing music, speech, and noise, 16 kHz, suitable for voice activity detection (VAD) and music/speech discrimination.
    • Speech: 426 recordings (≈60 hours), sourced from LibriVox and the US government.
    • Music: 660 recordings (≈42 hours), sourced from Jamendo, Free Music Archive, Incompetech, and HD Classical Music.
    • Noise: 930 recordings (≈6 hours), sourced from Free Sound and the Sound Bible.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Audio Data
Speech Recognition

Source

Organization: github

Created: 11/8/2023

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.