Dataset assetOpen Source CommunityMusic AnalysisMusic Understanding

google/MusicCaps

MusicCaps is a dataset of 5,521 music excerpts, each paired with an English aspect list and a free‑text caption written by musicians. Captions focus on acoustic characteristics rather than metadata such as artist name. The dataset is released as a CSV file containing YouTube video IDs and start/end timestamps; users must download the corresponding YouTube videos and clip them according to the timestamps.

Source

hugging_face

Created

Nov 28, 2025

Updated

Mar 8, 2023

Signals

657 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Card for MusicCaps

Dataset Description

Overview

MusicCaps contains 5,521 music examples, each annotated with an English aspect list and a free‑text caption authored by musicians. An example aspect list might be "pop, tinny wide hi hats, mellow piano melody, high‑pitched female vocal melody, sustained pulsating synth lead", while a caption may consist of several sentences describing the sound, e.g.:

"A low‑sounding male voice is rapping over fast‑paced drums playing a reggaeton beat along with a bass. Something like a guitar is playing the melody. This recording is of poor audio quality. In the background, laughter can be heard. This song may be playing in a bar."

These annotated examples are extracted from the AudioSet dataset’s 10‑second music clips (2,858 from the eval split, 2,663 from the train split).

Usage

The released data are provided as a .csv file containing YouTube video IDs and start/end timestamps. Using the dataset requires downloading the corresponding YouTube videos and clipping them according to the timestamps.

Supported Tasks and Leaderboards

[More information needed]

Language

[More information needed]

Dataset Structure

Data Instances

[More information needed]

Data Fields

ytid: YouTube ID of the video containing the annotated music segment. The segment can be listened to at https://youtu.be/watch?v={ytid}&start={start_s}.
start_s: Start time of the music segment within the YouTube video.
end_s: End time of the music segment (all segments are 10 seconds long).
audioset_positive_labels: Labels from the AudioSet dataset for this segment.
aspect_list: List of aspects describing the music.
caption: Multi‑sentence free‑text caption describing the music.
author_id: Integer used to group samples by author.
is_balanced_subset: If true, the row belongs to a genre‑balanced 1 k subset.
is_audioset_eval: If true, the segment comes from the AudioSet eval split; otherwise from the train split.

Data Splits

[More information needed]

Dataset Creation

Curation Rationale

[More information needed]

Source Data

Initial Data Collection and Normalization

[More information needed]

Who Produced the Source Language?

[More information needed]

Annotation

Annotation Process

[More information needed]

Who Were the Annotators?

[More information needed]

Personal and Sensitive Information

[More information needed]

Considerations for Using the Dataset

Societal Impact

[More information needed]

Discussion of Biases

[More information needed]

Other Known Limitations

[More information needed]

Additional Information

Dataset Curators

The dataset was shared by @googleai.

License

The dataset is released under CC‑BY‑SA‑4.0.

Citation

[More information needed]

Contributions

[More information needed]

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio