Back to datasets
Dataset assetOpen Source CommunitySpeech RecognitionAudio Data
AudioDataset
A repository containing various audio datasets, including speech, music, and audio mixture datasets. Speech datasets such as VCTK and LibriSpeech, music dataset such as StarNet, and audio mixture datasets such as Libri2Mix and Divide and Remaster (DnR).
Source
github
Created
Nov 8, 2023
Updated
May 14, 2024
Signals
523 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Speech Datasets
-
VCTK v0.92
- Contains 110 English speakers with various accents, each reading approximately 400 sentences.
- All recordings converted to 48 kHz.
- Training set size: To be determined.
-
VoiceBank-DEMAND
- Single-channel human speech, 48 kHz.
- Speech with noise, where noise is sourced from the DEMAND dataset.
- 28‑speaker version:
- Training set includes 11,572 utterances (≈9.4 hours).
- Test set includes 824 utterances (≈0.6 hours).
- Validation set (separated from training) includes 770 utterances (≈0.6 hours), remaining training set includes 10,802 utterances (≈8.8 hours).
- 56‑speaker version: To be determined.
-
LibriSpeech
- Approximately 1000 hours of 16kHz read English speech corpus.
- Training set: 100h + 360h + 500h (total 960h).
- Development and test set details: To be determined.
-
DNS-Challenge 5
- Contains multilingual clean speech and various noises.
Music Datasets
- StarNet
- Contains 104 classical music tracks at 48 kHz, sourced from corresponding free MIDI files.
- Track details:
- xxx.0.wav: Clarinet–vibraslap mix
- xxx.1.wav: Clarinet track
- xxx.2.wav: Vibraslap track
- xxx.3.wav: String–piano mix
- xxx.4.wav: String track
- xxx.5.wav: Piano track
Audio Mixture Datasets
-
Libri2Mix
- Details: To be determined.
-
Divide and Remaster (DnR)
- Single-channel mixed audio, containing speech, music, and sound effects/background tracks, 44.1 kHz.
- Audio sources: LibriSpeech (speech), Free Music Archive (music), FSD50k (sound effects).
- Training set includes 3,406 mixes (≈57 hours).
- Validation set includes 487 mixes (≈8 hours).
- Test set includes 973 mixes (≈16 hours).
-
MUSAN
- Corpus containing music, speech, and noise, 16 kHz, suitable for voice activity detection (VAD) and music/speech discrimination.
- Speech: 426 recordings (≈60 hours), sourced from LibriVox and the US government.
- Music: 660 recordings (≈42 hours), sourced from Jamendo, Free Music Archive, Incompetech, and HD Classical Music.
- Noise: 930 recordings (≈6 hours), sourced from Free Sound and the Sound Bible.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.