MovieLens Recommendation System Project

Overview

This project, developed as part of the HarvardX Data Science Capstone course, focuses on using the MovieLens dataset to build a movie recommendation system. The goal is to predict movie ratings based on users' past ratings and evaluate model performance using Root Mean Square Error (RMSE).

The project is implemented in two programming environments:

Python with Jupyter Notebooks, constructing and visualizing multiple models.
R with R scripts and R Markdown to ensure reproducibility and detailed documentation.

Both implementations allow solving the same problem with different tools, yielding comparable results.

Project Structure

The repository is organized as follows:

├── Capstone.ipynb # Python Jupyter Notebook implementation ├── MovieReviews.R # R script for the recommendation system ├── MovieReviews.Rmd # R Markdown report for the R implementation ├── MovieReviews.pdf # PDF report generated from R Markdown ├── README.md # Project overview (this file) ├── links.csv # MovieLens links dataset ├── movies.csv # MovieLens movies dataset ├── ratings.csv # MovieLens ratings dataset ├── tags.csv # MovieLens tags dataset

Dataset

The project uses a subset of the MovieLens dataset, including:

ratings.csv: User ratings for various movies.
movies.csv: Movie titles and genres.
tags.csv: User‑generated movie tags.
links.csv: Identifiers for external movie data sources (IMDB, TMDb).

Environment Requirements

Python Environment

Running the Python implementation requires the following libraries:

pandas
numpy
scikit-learn
matplotlib
Jupyter Notebook

Install the required packages with:

pip install pandas numpy scikit-learn matplotlib jupyter

R Environment

For the R implementation, the following R packages are needed:

tidyverse
caret

Install them with:

install.packages("tidyverse")
install.packages("caret")

Model Development

The project involves the following steps:

Split the dataset into training (edx) and validation (final_holdout_test) sets.
Build and evaluate multiple models to predict movie ratings.
Visualize results and compare RMSE for each model.

Implemented Models

Both Python and R implement the following models:

Baseline Model: Predict using the average rating of all movies.
- Python RMSE: 0.9665
- R RMSE: 1.0425
Movie Effect Model: Adjust for movie‑specific bias.
- Python RMSE: 0.9665
- R RMSE: 0.9617
Movie + User Effect Model: Adds user‑specific bias to the movie effect.
- Python RMSE: 0.9071
- R RMSE: 0.8731
Regularized Movie + User Effect Model: Applies regularization to prevent overfitting.
- Python RMSE: 0.8742
- R RMSE: 0.8527
Hybrid Model: Combines multiple models (movie effect, user effect, regularization) and incorporates Matrix Factorization (SVD) and K‑Nearest Neighbors (KNN).
- Python Hybrid Model RMSE: 0.8655

Visualizations (Python)

The Jupyter Notebook generates several visualizations to analyze data and model performance:

Rating Distribution: Shows the distribution of ratings, highlighting integer peaks.
Top 10 Most Rated Movies: Bar chart of movies receiving the most ratings.
Average Rating by Release Year: Scatter plot of average rating versus movie release year.
RMSE Comparison: Bar chart comparing RMSE across models.

Results Comparison

Results from Python and R are similar; the regularized movie + user effect model performs best in both environments:

Model	Python RMSE	R RMSE
Baseline Model	0.9665	1.0425
Movie Effect Model	0.9665	0.9617
Movie + User Effect Model	0.9071	0.8731
Regularized Movie + User Model	0.8742	0.8527
Hybrid Model	0.8655	N/A

Conclusion

The project demonstrates the process of building a movie recommendation system using Python and R. The best performance is achieved by the regularized movie + user effect model, with an RMSE of 0.8527 in R. Future improvements could include advanced techniques such as matrix factorization or neural networks to further optimize the system.

MovieLens

Description