google/MusicCaps
MusicCaps is a dataset of 5,521 music excerpts, each paired with an English aspect list and a free‑text caption written by musicians. Captions focus on acoustic characteristics rather than metadata such as artist name. The dataset is released as a CSV file containing YouTube video IDs and start/end timestamps; users must download the corresponding YouTube videos and clip them according to the timestamps.
Description
Dataset Card for MusicCaps
Dataset Description
Overview
MusicCaps contains 5,521 music examples, each annotated with an English aspect list and a free‑text caption authored by musicians. An example aspect list might be "pop, tinny wide hi hats, mellow piano melody, high‑pitched female vocal melody, sustained pulsating synth lead", while a caption may consist of several sentences describing the sound, e.g.:
"A low‑sounding male voice is rapping over fast‑paced drums playing a reggaeton beat along with a bass. Something like a guitar is playing the melody. This recording is of poor audio quality. In the background, laughter can be heard. This song may be playing in a bar."
These annotated examples are extracted from the AudioSet dataset’s 10‑second music clips (2,858 from the eval split, 2,663 from the train split).
Usage
The released data are provided as a .csv file containing YouTube video IDs and start/end timestamps. Using the dataset requires downloading the corresponding YouTube videos and clipping them according to the timestamps.
Supported Tasks and Leaderboards
[More information needed]
Language
[More information needed]
Dataset Structure
Data Instances
[More information needed]
Data Fields
- ytid: YouTube ID of the video containing the annotated music segment. The segment can be listened to at
https://youtu.be/watch?v={ytid}&start={start_s}. - start_s: Start time of the music segment within the YouTube video.
- end_s: End time of the music segment (all segments are 10 seconds long).
- audioset_positive_labels: Labels from the AudioSet dataset for this segment.
- aspect_list: List of aspects describing the music.
- caption: Multi‑sentence free‑text caption describing the music.
- author_id: Integer used to group samples by author.
- is_balanced_subset: If true, the row belongs to a genre‑balanced 1 k subset.
- is_audioset_eval: If true, the segment comes from the AudioSet eval split; otherwise from the train split.
Data Splits
[More information needed]
Dataset Creation
Curation Rationale
[More information needed]
Source Data
Initial Data Collection and Normalization
[More information needed]
Who Produced the Source Language?
[More information needed]
Annotation
Annotation Process
[More information needed]
Who Were the Annotators?
[More information needed]
Personal and Sensitive Information
[More information needed]
Considerations for Using the Dataset
Societal Impact
[More information needed]
Discussion of Biases
[More information needed]
Other Known Limitations
[More information needed]
Additional Information
Dataset Curators
The dataset was shared by @googleai.
License
The dataset is released under CC‑BY‑SA‑4.0.
Citation
[More information needed]
Contributions
[More information needed]
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.