google/MusicCaps
MusicCaps is a dataset of 5,521 music excerpts, each paired with an English aspect list and a free‑text caption written by musicians. Captions focus on acoustic characteristics rather than metadata such as artist name. The dataset is released as a CSV file containing YouTube video IDs and start/end timestamps; users must download the corresponding YouTube videos and clip them according to the timestamps.
Dataset description and usage context
Dataset Card for MusicCaps
Dataset Description
Overview
MusicCaps contains 5,521 music examples, each annotated with an English aspect list and a free‑text caption authored by musicians. An example aspect list might be "pop, tinny wide hi hats, mellow piano melody, high‑pitched female vocal melody, sustained pulsating synth lead", while a caption may consist of several sentences describing the sound, e.g.:
"A low‑sounding male voice is rapping over fast‑paced drums playing a reggaeton beat along with a bass. Something like a guitar is playing the melody. This recording is of poor audio quality. In the background, laughter can be heard. This song may be playing in a bar."
These annotated examples are extracted from the AudioSet dataset’s 10‑second music clips (2,858 from the eval split, 2,663 from the train split).
Usage
The released data are provided as a .csv file containing YouTube video IDs and start/end timestamps. Using the dataset requires downloading the corresponding YouTube videos and clipping them according to the timestamps.
Supported Tasks and Leaderboards
[More information needed]
Language
[More information needed]
Dataset Structure
Data Instances
[More information needed]
Data Fields
- ytid: YouTube ID of the video containing the annotated music segment. The segment can be listened to at
https://youtu.be/watch?v={ytid}&start={start_s}. - start_s: Start time of the music segment within the YouTube video.
- end_s: End time of the music segment (all segments are 10 seconds long).
- audioset_positive_labels: Labels from the AudioSet dataset for this segment.
- aspect_list: List of aspects describing the music.
- caption: Multi‑sentence free‑text caption describing the music.
- author_id: Integer used to group samples by author.
- is_balanced_subset: If true, the row belongs to a genre‑balanced 1 k subset.
- is_audioset_eval: If true, the segment comes from the AudioSet eval split; otherwise from the train split.
Data Splits
[More information needed]
Dataset Creation
Curation Rationale
[More information needed]
Source Data
Initial Data Collection and Normalization
[More information needed]
Who Produced the Source Language?
[More information needed]
Annotation
Annotation Process
[More information needed]
Who Were the Annotators?
[More information needed]
Personal and Sensitive Information
[More information needed]
Considerations for Using the Dataset
Societal Impact
[More information needed]
Discussion of Biases
[More information needed]
Other Known Limitations
[More information needed]
Additional Information
Dataset Curators
The dataset was shared by @googleai.
License
The dataset is released under CC‑BY‑SA‑4.0.
Citation
[More information needed]
Contributions
[More information needed]
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.