DATASET
Open Source Community
AudioDataset
A repository containing various audio datasets, including speech, music, and audio mixture datasets. Speech datasets such as VCTK and LibriSpeech, music dataset such as StarNet, and audio mixture datasets such as Libri2Mix and Divide and Remaster (DnR).
Updated 5/14/2024
github
Description
Dataset Overview
Speech Datasets
-
VCTK v0.92
- Contains 110 English speakers with various accents, each reading approximately 400 sentences.
- All recordings converted to 48 kHz.
- Training set size: To be determined.
-
VoiceBank-DEMAND
- Single-channel human speech, 48 kHz.
- Speech with noise, where noise is sourced from the DEMAND dataset.
- 28‑speaker version:
- Training set includes 11,572 utterances (≈9.4 hours).
- Test set includes 824 utterances (≈0.6 hours).
- Validation set (separated from training) includes 770 utterances (≈0.6 hours), remaining training set includes 10,802 utterances (≈8.8 hours).
- 56‑speaker version: To be determined.
-
LibriSpeech
- Approximately 1000 hours of 16kHz read English speech corpus.
- Training set: 100h + 360h + 500h (total 960h).
- Development and test set details: To be determined.
-
DNS-Challenge 5
- Contains multilingual clean speech and various noises.
Music Datasets
- StarNet
- Contains 104 classical music tracks at 48 kHz, sourced from corresponding free MIDI files.
- Track details:
- xxx.0.wav: Clarinet–vibraslap mix
- xxx.1.wav: Clarinet track
- xxx.2.wav: Vibraslap track
- xxx.3.wav: String–piano mix
- xxx.4.wav: String track
- xxx.5.wav: Piano track
Audio Mixture Datasets
-
Libri2Mix
- Details: To be determined.
-
Divide and Remaster (DnR)
- Single-channel mixed audio, containing speech, music, and sound effects/background tracks, 44.1 kHz.
- Audio sources: LibriSpeech (speech), Free Music Archive (music), FSD50k (sound effects).
- Training set includes 3,406 mixes (≈57 hours).
- Validation set includes 487 mixes (≈8 hours).
- Test set includes 973 mixes (≈16 hours).
-
MUSAN
- Corpus containing music, speech, and noise, 16 kHz, suitable for voice activity detection (VAD) and music/speech discrimination.
- Speech: 426 recordings (≈60 hours), sourced from LibriVox and the US government.
- Music: 660 recordings (≈42 hours), sourced from Jamendo, Free Music Archive, Incompetech, and HD Classical Music.
- Noise: 930 recordings (≈6 hours), sourced from Free Sound and the Sound Bible.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Audio Data
Speech Recognition
Source
Organization: github
Created: 11/8/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.