Dataset assetOpen Source CommunitySpeech RecognitionAudio Data

AudioDataset

A repository containing various audio datasets, including speech, music, and audio mixture datasets. Speech datasets such as VCTK and LibriSpeech, music dataset such as StarNet, and audio mixture datasets such as Libri2Mix and Divide and Remaster (DnR).

Source

github

Created

Nov 8, 2023

Updated

May 14, 2024

Signals

523 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Speech Datasets

VCTK v0.92
- Contains 110 English speakers with various accents, each reading approximately 400 sentences.
- All recordings converted to 48 kHz.
- Training set size: To be determined.
VoiceBank-DEMAND
- Single-channel human speech, 48 kHz.
- Speech with noise, where noise is sourced from the DEMAND dataset.
- 28‑speaker version:
  - Training set includes 11,572 utterances (≈9.4 hours).
  - Test set includes 824 utterances (≈0.6 hours).
  - Validation set (separated from training) includes 770 utterances (≈0.6 hours), remaining training set includes 10,802 utterances (≈8.8 hours).
- 56‑speaker version: To be determined.
LibriSpeech
- Approximately 1000 hours of 16kHz read English speech corpus.
- Training set: 100h + 360h + 500h (total 960h).
- Development and test set details: To be determined.
DNS-Challenge 5
- Contains multilingual clean speech and various noises.

Music Datasets

StarNet
- Contains 104 classical music tracks at 48 kHz, sourced from corresponding free MIDI files.
- Track details:
  - xxx.0.wav: Clarinet–vibraslap mix
  - xxx.1.wav: Clarinet track
  - xxx.2.wav: Vibraslap track
  - xxx.3.wav: String–piano mix
  - xxx.4.wav: String track
  - xxx.5.wav: Piano track

Audio Mixture Datasets

Libri2Mix
- Details: To be determined.
Divide and Remaster (DnR)
- Single-channel mixed audio, containing speech, music, and sound effects/background tracks, 44.1 kHz.
- Audio sources: LibriSpeech (speech), Free Music Archive (music), FSD50k (sound effects).
- Training set includes 3,406 mixes (≈57 hours).
- Validation set includes 487 mixes (≈8 hours).
- Test set includes 973 mixes (≈16 hours).
MUSAN
- Corpus containing music, speech, and noise, 16 kHz, suitable for voice activity detection (VAD) and music/speech discrimination.
- Speech: 426 recordings (≈60 hours), sourced from LibriVox and the US government.
- Music: 660 recordings (≈42 hours), sourced from Jamendo, Free Music Archive, Incompetech, and HD Classical Music.
- Noise: 930 recordings (≈6 hours), sourced from Free Sound and the Sound Bible.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio