DATASET
Open Source Community
Million Song Dataset
The dataset comprises nine relational tables covering artist‑ and track‑level features such as release year, geographic coordinates of the artist's location, track duration, artist popularity score, tempo, etc. It also includes user‑generated tags describing artists' nationality, genre, and descriptive terms.
Updated 4/12/2024
github
Description
Dataset Overview
Data Source
- The dataset originates from the Million Song Dataset (MSD).
- It was stored in a Postgres database by colleague Aaron Munoz and exported as CSV files.
Data Content
- The dataset contains 9 relational tables covering artist‑level and track‑level features.
- Tags and user‑generated tag terms are based on the artist level, but analysis is performed at the track level.
- Twelve genres were selected from 7,643 unique genres, with jazz and rock used for analysis.
- Primary features include release year, artist location coordinates, track duration, artist popularity score, track tempo, and 16 other descriptive numeric variables.
- One‑hot encoding was applied to 2,314 tag terms, creating a sparse matrix of 1 M songs.
- Tag terms include geography/nationality, genre, or song description.
Data Processing
- Dimensionality reduction reduced the final table size from 18.5 GB to 2.6 GB.
- 99.5 % of the data were used for analysis and statistical modeling.
Analysis Methods
- Because of overlap between genres, especially folk and country, a one‑vs‑rest approach was used for multi‑class classification.
- Correlations among primary features were weak, ranging from (‑0.3, 0.3).
Model Setup
- A 64/16/20 train‑validation‑test split was employed for statistical modeling.
- Positive class weights were 76.7 % for rock and 47 % for jazz.
Statistical Modeling
- A basic logistic regression model showed that artist familiarity score had the largest impact on rock songs, with an odds ratio between 25.23 and 27.96.
- For jazz, song mode and time‑signature confidence were not significant variables.
- Balanced class weights were selected to achieve the best AUC score.
- GridSearchCV was used to tune hyper‑parameters for logistic regression and decision tree models, finally selecting a balanced logistic regression model.
Model Performance
- For the jazz genre, the model with term‑tag features achieved an F1 score of 0.602, ROC‑AUC of 0.6865, and accuracy of 0.6347, a 5.35 % improvement over the primary‑feature model.
- For the rock genre, the model achieved an F1 score of 0.7616, ROC‑AUC of 0.7753, and accuracy of 0.6808, a 6.38 % improvement over the primary‑feature model.
Conclusion
- Statistical analysis indicates that user‑generated term‑tag features significantly improve classification performance for jazz and rock songs.
- It is recommended to integrate a social platform into existing song recommendation systems to increase classification accuracy by approximately 6 %.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Music Data Analysis
Artist Features
Source
Organization: github
Created: 2/25/2018
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.