MovieLens
The MovieLens dataset includes user ratings for various movies, movie titles and genres, user‑generated movie tags, and identifiers from external movie data sources.
Description
MovieLens Recommendation System Project
Overview
This project, developed as part of the HarvardX Data Science Capstone course, focuses on using the MovieLens dataset to build a movie recommendation system. The goal is to predict movie ratings based on users' past ratings and evaluate model performance using Root Mean Square Error (RMSE).
The project is implemented in two programming environments:
- Python with Jupyter Notebooks, constructing and visualizing multiple models.
- R with R scripts and R Markdown to ensure reproducibility and detailed documentation.
Both implementations allow solving the same problem with different tools, yielding comparable results.
Project Structure
The repository is organized as follows:
├── Capstone.ipynb # Python Jupyter Notebook implementation ├── MovieReviews.R # R script for the recommendation system ├── MovieReviews.Rmd # R Markdown report for the R implementation ├── MovieReviews.pdf # PDF report generated from R Markdown ├── README.md # Project overview (this file) ├── links.csv # MovieLens links dataset ├── movies.csv # MovieLens movies dataset ├── ratings.csv # MovieLens ratings dataset ├── tags.csv # MovieLens tags dataset
Dataset
The project uses a subset of the MovieLens dataset, including:
ratings.csv: User ratings for various movies.movies.csv: Movie titles and genres.tags.csv: User‑generated movie tags.links.csv: Identifiers for external movie data sources (IMDB, TMDb).
Environment Requirements
Python Environment
Running the Python implementation requires the following libraries:
pandasnumpyscikit-learnmatplotlib- Jupyter Notebook
Install the required packages with:
pip install pandas numpy scikit-learn matplotlib jupyter
R Environment
For the R implementation, the following R packages are needed:
tidyversecaret
Install them with:
install.packages("tidyverse")
install.packages("caret")
Model Development
The project involves the following steps:
- Split the dataset into training (
edx) and validation (final_holdout_test) sets. - Build and evaluate multiple models to predict movie ratings.
- Visualize results and compare RMSE for each model.
Implemented Models
Both Python and R implement the following models:
-
Baseline Model: Predict using the average rating of all movies.
- Python RMSE:
0.9665 - R RMSE:
1.0425
- Python RMSE:
-
Movie Effect Model: Adjust for movie‑specific bias.
- Python RMSE:
0.9665 - R RMSE:
0.9617
- Python RMSE:
-
Movie + User Effect Model: Adds user‑specific bias to the movie effect.
- Python RMSE:
0.9071 - R RMSE:
0.8731
- Python RMSE:
-
Regularized Movie + User Effect Model: Applies regularization to prevent overfitting.
- Python RMSE:
0.8742 - R RMSE:
0.8527
- Python RMSE:
-
Hybrid Model: Combines multiple models (movie effect, user effect, regularization) and incorporates Matrix Factorization (SVD) and K‑Nearest Neighbors (KNN).
- Python Hybrid Model RMSE:
0.8655
- Python Hybrid Model RMSE:
Visualizations (Python)
The Jupyter Notebook generates several visualizations to analyze data and model performance:
- Rating Distribution: Shows the distribution of ratings, highlighting integer peaks.
- Top 10 Most Rated Movies: Bar chart of movies receiving the most ratings.
- Average Rating by Release Year: Scatter plot of average rating versus movie release year.
- RMSE Comparison: Bar chart comparing RMSE across models.
Results Comparison
Results from Python and R are similar; the regularized movie + user effect model performs best in both environments:
| Model | Python RMSE | R RMSE |
|---|---|---|
| Baseline Model | 0.9665 | 1.0425 |
| Movie Effect Model | 0.9665 | 0.9617 |
| Movie + User Effect Model | 0.9071 | 0.8731 |
| Regularized Movie + User Model | 0.8742 | 0.8527 |
| Hybrid Model | 0.8655 | N/A |
Conclusion
The project demonstrates the process of building a movie recommendation system using Python and R. The best performance is achieved by the regularized movie + user effect model, with an RMSE of 0.8527 in R. Future improvements could include advanced techniques such as matrix factorization or neural networks to further optimize the system.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 9/14/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.