Dataset assetOpen Source CommunityMusic Data AnalysisArtist Features

Million Song Dataset

The dataset comprises nine relational tables covering artist‑ and track‑level features such as release year, geographic coordinates of the artist's location, track duration, artist popularity score, tempo, etc. It also includes user‑generated tags describing artists' nationality, genre, and descriptive terms.

Source

github

Created

Feb 25, 2018

Updated

Apr 12, 2024

Signals

251 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Data Source

The dataset originates from the Million Song Dataset (MSD).
It was stored in a Postgres database by colleague Aaron Munoz and exported as CSV files.

Data Content

The dataset contains 9 relational tables covering artist‑level and track‑level features.
Tags and user‑generated tag terms are based on the artist level, but analysis is performed at the track level.
Twelve genres were selected from 7,643 unique genres, with jazz and rock used for analysis.
Primary features include release year, artist location coordinates, track duration, artist popularity score, track tempo, and 16 other descriptive numeric variables.
One‑hot encoding was applied to 2,314 tag terms, creating a sparse matrix of 1 M songs.
Tag terms include geography/nationality, genre, or song description.

Data Processing

Dimensionality reduction reduced the final table size from 18.5 GB to 2.6 GB.
99.5 % of the data were used for analysis and statistical modeling.

Analysis Methods

Because of overlap between genres, especially folk and country, a one‑vs‑rest approach was used for multi‑class classification.
Correlations among primary features were weak, ranging from (‑0.3, 0.3).

Model Setup

A 64/16/20 train‑validation‑test split was employed for statistical modeling.
Positive class weights were 76.7 % for rock and 47 % for jazz.

Statistical Modeling

A basic logistic regression model showed that artist familiarity score had the largest impact on rock songs, with an odds ratio between 25.23 and 27.96.
For jazz, song mode and time‑signature confidence were not significant variables.
Balanced class weights were selected to achieve the best AUC score.
GridSearchCV was used to tune hyper‑parameters for logistic regression and decision tree models, finally selecting a balanced logistic regression model.

Model Performance

For the jazz genre, the model with term‑tag features achieved an F1 score of 0.602, ROC‑AUC of 0.6865, and accuracy of 0.6347, a 5.35 % improvement over the primary‑feature model.
For the rock genre, the model achieved an F1 score of 0.7616, ROC‑AUC of 0.7753, and accuracy of 0.6808, a 6.38 % improvement over the primary‑feature model.

Conclusion

Statistical analysis indicates that user‑generated term‑tag features significantly improve classification performance for jazz and rock songs.
It is recommended to integrate a social platform into existing song recommendation systems to increase classification accuracy by approximately 6 %.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio