cvssp/WavCaps

--- license: cc-by-4.0 language: - en size_categories: - 100B<n<1T --- # WavCaps WavCaps is a ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, where the audio clips are sourced from three websites ([FreeSound](https://freesound.org/), [BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/), and [SoundBible](https://soundbible.com/)) and a sound event detection dataset ([AudioSet Strongly-labelled Subset](https://research.google.com/audioset/download_strong.html)). - **Paper:** https://arxiv.org/abs/2303.17395 - **Github:** https://github.com/XinhaoMei/WavCaps ## Statistics | Data Source | # audio | avg. audio duration (s) | avg. text length | |--------------------|----------|-------------------------|------------------| | FreeSound | 262300 | 85.98 | 6.77 | | BBC Sound Effects | 31201 | 115.04 | 9.67 | | SoundBible | 1232 | 13.12 | 5.87 | | AudioSet SL subset | 108317 | 10.00 | 9.79 | | WavCaps | 403050 | 67.59 | 7.80 | ## Download We provide a json file for each data source. For audio clips sourced from websites, we provide processed caption, raw description, as well as other metadata. For audio clips from AudioSet, we use the version from PANNs, where each file name is appended with a 'Y' at the start. For the start time, please refer to the original metadata of AudioSet SL subset. Waveforms with flac format can be downloaded through [Zip_files](https://huggingface.co/datasets/cvssp/WavCaps/tree/main/Zip_files) directory. Pretrained models can be downloaded [here](https://drive.google.com/drive/folders/1pFr8IRY3E1FAtc2zjYmeuSVY3M5a-Kdj?usp=share_link). <font color='red'>If you get "error: invalid zip file with overlapped components (possible zip bomb)" when unzipping, please try the following commands: </font> `zip -F AudioSet_SL.zip --out AS.zip` `unzip AS.zip` ## License Only academic uses are allowed for WavCaps dataset. By downloading audio clips through the links provided in the json files, you agree that you will use the audios for research purposes only. For credits for audio clips from FreeSound, please refer to its own page. For detailed license information, please refer to: [FreeSound](https://freesound.org/help/faq/#licenses), [BBC Sound Effects](https://sound-effects.bbcrewind.co.uk/licensing), [SoundBible](https://soundbible.com/about.php) The models we provided are created under a UK data copyright exemption for non-commercial research. ## Code for related tasks We provide codes and pre-trained models for audio-language retrieval, automated audio captioning, and zero-shot audio classification. * [Retrieval](https://github.com/XinhaoMei/WavCaps/tree/master/retrieval) * [Captioning](https://github.com/XinhaoMei/WavCaps/tree/master/captioning) * [Zero-shot Audio Classification](https://github.com/XinhaoMei/WavCaps/blob/master/retrieval/zero_shot_classification.py) * [Text-to-Sound Generation](https://github.com/haoheliu/AudioLDM) ## Citation Please cite the following if you make use of the dataset. ```bibtex @article{mei2023wavcaps, title={WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research}, author={Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu}, journal={arXiv preprint arXiv:2303.17395}, year={2023} } ```

Updated 7/6/2023

hugging_face

Description

WavCaps Dataset Overview

Dataset Description

WavCaps is a weakly‑labeled audio‑caption dataset assisted by ChatGPT for audio‑language multimodal research. Audio clips are sourced from four repositories:

Dataset Statistics

Data Source	#Audio Clips	Avg. Duration (s)	Avg. Text Length
FreeSound	262,300	85.98	6.77
BBC Sound Effects	31,201	115.04	9.67
SoundBible	1,232	13.12	5.87
AudioSet SL subset	108,317	10.00	9.79
WavCaps	403,050	67.59	7.80

Download Information

The dataset provides a JSON file for each source. For audio clips obtained from websites, processed descriptions, original descriptions, and additional metadata are supplied. AudioSet clips use the PANNs version with a "Y" prefix on file names. Audio files are in FLAC format and can be downloaded via the Zip_files directory.

License

WavCaps is restricted to academic use. By downloading audio clips through the provided links, you agree to use them solely for research purposes. Detailed license information can be found at:

Related Code

Code for audio‑language retrieval, automatic audio captioning, and zero‑shot audio classification is provided.

Citation

If you use this dataset, please cite the following paper:

@article{mei2023wavcaps,
  title={WavCaps: A ChatGPT‑Assisted Weakly‑Labelled Audio Captioning Dataset for Audio‑Language Multimodal Research},
  author={Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu},
  journal={arXiv preprint arXiv:2303.17395},
  year={2023}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Audio Captioning

Multimodal Research

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →