JUHE API Marketplace
DATASET
Open Source Community

Million Song Dataset

The dataset comprises nine relational tables covering artist‑ and track‑level features such as release year, geographic coordinates of the artist's location, track duration, artist popularity score, tempo, etc. It also includes user‑generated tags describing artists' nationality, genre, and descriptive terms.

Updated 4/12/2024
github

Description

Dataset Overview

Data Source

Data Content

  • The dataset contains 9 relational tables covering artist‑level and track‑level features.
  • Tags and user‑generated tag terms are based on the artist level, but analysis is performed at the track level.
  • Twelve genres were selected from 7,643 unique genres, with jazz and rock used for analysis.
  • Primary features include release year, artist location coordinates, track duration, artist popularity score, track tempo, and 16 other descriptive numeric variables.
  • One‑hot encoding was applied to 2,314 tag terms, creating a sparse matrix of 1 M songs.
  • Tag terms include geography/nationality, genre, or song description.

Data Processing

  • Dimensionality reduction reduced the final table size from 18.5 GB to 2.6 GB.
  • 99.5 % of the data were used for analysis and statistical modeling.

Analysis Methods

  • Because of overlap between genres, especially folk and country, a one‑vs‑rest approach was used for multi‑class classification.
  • Correlations among primary features were weak, ranging from (‑0.3, 0.3).

Model Setup

  • A 64/16/20 train‑validation‑test split was employed for statistical modeling.
  • Positive class weights were 76.7 % for rock and 47 % for jazz.

Statistical Modeling

  • A basic logistic regression model showed that artist familiarity score had the largest impact on rock songs, with an odds ratio between 25.23 and 27.96.
  • For jazz, song mode and time‑signature confidence were not significant variables.
  • Balanced class weights were selected to achieve the best AUC score.
  • GridSearchCV was used to tune hyper‑parameters for logistic regression and decision tree models, finally selecting a balanced logistic regression model.

Model Performance

  • For the jazz genre, the model with term‑tag features achieved an F1 score of 0.602, ROC‑AUC of 0.6865, and accuracy of 0.6347, a 5.35 % improvement over the primary‑feature model.
  • For the rock genre, the model achieved an F1 score of 0.7616, ROC‑AUC of 0.7753, and accuracy of 0.6808, a 6.38 % improvement over the primary‑feature model.

Conclusion

  • Statistical analysis indicates that user‑generated term‑tag features significantly improve classification performance for jazz and rock songs.
  • It is recommended to integrate a social platform into existing song recommendation systems to increase classification accuracy by approximately 6 %.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Music Data Analysis
Artist Features

Source

Organization: github

Created: 2/25/2018

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.