JUHE API Marketplace
DATASET
Open Source Community

MovieLens

The MovieLens dataset includes user ratings for various movies, movie titles and genres, user‑generated movie tags, and identifiers from external movie data sources.

Updated 9/14/2024
github

Description

MovieLens Recommendation System Project

Overview

This project, developed as part of the HarvardX Data Science Capstone course, focuses on using the MovieLens dataset to build a movie recommendation system. The goal is to predict movie ratings based on users' past ratings and evaluate model performance using Root Mean Square Error (RMSE).

The project is implemented in two programming environments:

  • Python with Jupyter Notebooks, constructing and visualizing multiple models.
  • R with R scripts and R Markdown to ensure reproducibility and detailed documentation.

Both implementations allow solving the same problem with different tools, yielding comparable results.

Project Structure

The repository is organized as follows:

├── Capstone.ipynb # Python Jupyter Notebook implementation ├── MovieReviews.R # R script for the recommendation system ├── MovieReviews.Rmd # R Markdown report for the R implementation ├── MovieReviews.pdf # PDF report generated from R Markdown ├── README.md # Project overview (this file) ├── links.csv # MovieLens links dataset ├── movies.csv # MovieLens movies dataset ├── ratings.csv # MovieLens ratings dataset ├── tags.csv # MovieLens tags dataset

Dataset

The project uses a subset of the MovieLens dataset, including:

  • ratings.csv: User ratings for various movies.
  • movies.csv: Movie titles and genres.
  • tags.csv: User‑generated movie tags.
  • links.csv: Identifiers for external movie data sources (IMDB, TMDb).

Environment Requirements

Python Environment

Running the Python implementation requires the following libraries:

  • pandas
  • numpy
  • scikit-learn
  • matplotlib
  • Jupyter Notebook

Install the required packages with:

pip install pandas numpy scikit-learn matplotlib jupyter

R Environment

For the R implementation, the following R packages are needed:

  • tidyverse
  • caret

Install them with:

install.packages("tidyverse")
install.packages("caret")

Model Development

The project involves the following steps:

  • Split the dataset into training (edx) and validation (final_holdout_test) sets.
  • Build and evaluate multiple models to predict movie ratings.
  • Visualize results and compare RMSE for each model.

Implemented Models

Both Python and R implement the following models:

  1. Baseline Model: Predict using the average rating of all movies.

    • Python RMSE: 0.9665
    • R RMSE: 1.0425
  2. Movie Effect Model: Adjust for movie‑specific bias.

    • Python RMSE: 0.9665
    • R RMSE: 0.9617
  3. Movie + User Effect Model: Adds user‑specific bias to the movie effect.

    • Python RMSE: 0.9071
    • R RMSE: 0.8731
  4. Regularized Movie + User Effect Model: Applies regularization to prevent overfitting.

    • Python RMSE: 0.8742
    • R RMSE: 0.8527
  5. Hybrid Model: Combines multiple models (movie effect, user effect, regularization) and incorporates Matrix Factorization (SVD) and K‑Nearest Neighbors (KNN).

    • Python Hybrid Model RMSE: 0.8655

Visualizations (Python)

The Jupyter Notebook generates several visualizations to analyze data and model performance:

  • Rating Distribution: Shows the distribution of ratings, highlighting integer peaks.
  • Top 10 Most Rated Movies: Bar chart of movies receiving the most ratings.
  • Average Rating by Release Year: Scatter plot of average rating versus movie release year.
  • RMSE Comparison: Bar chart comparing RMSE across models.

Results Comparison

Results from Python and R are similar; the regularized movie + user effect model performs best in both environments:

ModelPython RMSER RMSE
Baseline Model0.96651.0425
Movie Effect Model0.96650.9617
Movie + User Effect Model0.90710.8731
Regularized Movie + User Model0.87420.8527
Hybrid Model0.8655N/A

Conclusion

The project demonstrates the process of building a movie recommendation system using Python and R. The best performance is achieved by the regularized movie + user effect model, with an RMSE of 0.8527 in R. Future improvements could include advanced techniques such as matrix factorization or neural networks to further optimize the system.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Movie Recommendation
User Behaviour Analysis

Source

Organization: github

Created: 9/14/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.