Amazon
Amazon review data includes reviews (rating, text, helpful votes) and product metadata (description, category information, price, brand, and image features), with versions from 2014 and an updated version from 2018.
Description
Dataset Overview
Dataset Purpose
This dataset repository contains public data sources for recommender systems (RS). All these recommendation datasets can be converted into the atomic files defined by RecBole, which is a unified, comprehensive, and efficient recommendation library.
Dataset Conversion and Usage
To use RecBole, the original dataset needs to be converted into the data format defined by RecBole. Two conversion methods are provided:
- Download the original dataset and process it using the conversion tools provided in this repository.
- Directly download the processed atomic files.
Dataset Links and Introduction
Shopping
- Amazon
- Includes product reviews and metadata from 2014 and 2018, covering 24 categories and 142.8 million reviews.
- The 2018 version includes 29 categories and 233.1 million reviews.
- Amazon_M2
- Contains anonymous customer sessions and product data from six different regions.
- Alibaba-iFashion
- Fashion matching dataset collected from Alibaba's online shopping system.
- Epinions
- Dataset containing user trust relationships collected from Epinions.com.
- Yelp
- Review data from the Yelp website, with multiple versions for 2018, 2020, 2021, and 2022.
- Tmall
- Provided by Ant Financial Services for the IJCAI16 competition.
- DIGINETICA
- User session data extracted from e‑commerce search engine logs.
- YOOCHOOSE
- Dataset built to support the RecSys Challenge 2015.
- Retailrocket
- Dataset collected from a real e‑commerce website.
- Ta Feng
- Transaction data from a Chinese grocery store from Nov 2000 to Feb 2001.
Advertising
- Criteo
- Contains partial traffic data from Criteo.
- Avazu
- Dataset used for the Avazu CTR prediction competition.
- iPinYou
- Training and test data from the iPinYou global RTB bidding competition.
- AliEC
- Dataset on click‑through rate prediction for display ads on Taobao.
Check‑in
- Foursquare
- About 10 months of check‑in data from New York and Tokyo.
- Gowalla
- Check‑in data from Feb 2009 to Oct 2010.
Movies
- MovieLens
- Movie rating dataset collected and provided by GroupLens.
- Netflix
- Official dataset for the Netflix prize competition.
- Douban
- Over 2 million short comments from the Douban movie site.
- Twitch
- Dataset of user consumption of streaming content on Twitch.
Music
- Last.FM
- Social network, tags, and artist listening information for 2 K users of the Last.fm online music system.
- LFM-1b
- Dataset containing over 1 billion music listening events.
- Yahoo Music
- Represents the Yahoo! Music community's preferences for various artists.
- KGRec
- Dataset with users, items, implicit feedback interactions, item tags, and textual descriptions.
- Music4All-Onion
- Extends the Music4All dataset with 26 additional audio, video, and metadata features.
Books
- Book-Crossing
- 278,858 users provided 1,149,780 ratings for 271,379 books.
- GoodReads
- Reviews and various item attributes from the Goodreads website.
Games
- Steam
- Includes Steam reviews and game information, with 7,793,069 reviews, 2,567,538 users, and 32,135 games.
Anime
- Anime
- Anime rating data from MyAnimeList.net users.
Images
- Pinterest
- Dataset for evaluating content‑based image recommendation in social networks.
Jokes
- Jester
- Anonymous joke ratings from the Jester joke recommendation system.
Exercises
- KDD2010
- Student practice submissions from the KDD Cup 2010 education data mining challenge.
- EndoMondo
- Exercise log data from EndoMondo users.
Websites
- Phishing Websites
- 11 features and a label indicating whether each of 11,055 sites is a phishing site.
- Behance
- Small anonymous version with likes and image data from the Behance community.
Adult
- Adult
- Dataset extracted from the 1994 Census database, containing personal attributes and whether annual income exceeds 50 k.
News
- MIND
- Large dataset collected for news recommendation research, containing about 160 k English news articles and over 15 million impression logs generated by 1 M users.
Food
- DianPing
- User reviews and detailed business metadata crawled from the Chinese online review site DianPing.com.
- Food
- Cooking recipes and review text from Food.com.
Beverages
- BeerAdvocate
- Beer reviews with multiple rating dimensions.
- RateBeer
- Beer reviews with multiple rating dimensions.
Clothing
- ModCloth
- Clothing fit measurement data from ModCloth.
- RentTheRunway
- Clothing fit measurement data from RentTheRunway.
Dataset Statistics
| SN | Dataset | #User | #Item | #Interaction | Sparsity | Interaction Type | TimeStamp | User Context | Item Context | Interaction Context |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | MovieLens | - | - | - | - | Rating | √ | √ | √ | |
| 2 | Anime | 73,515 | 11,200 | 7,813,737 | 99.05% | Rating | √ | |||
| 3 | Epinions | 116,260 | 41,269 | 188,478 | 99.99% | Rating | √ | √ | ||
| 4 | Yelp (5 versions) | - | - | - | - | Rating | √ | √ | √ | √ |
| 5 | Netflix | 480,189 | 17,770 | 100,480,507 | 98.82% | Rating | √ | |||
| 6 | Book-Crossing | 105,284 | 340,557 | 1,149,780 | 99.99% | Rating | √ | √ | ||
| 7 | Jester | 73,421 | 101 | 4,136,360 | 44.22% | Rating | ||||
| 8 | Douban | 738,701 | 28 | 2,125,056 | 89.73% | Rating | √ | √ | ||
| 9 | Yahoo Music | 1,948,882 | 98,211 | 11,557,943 | 99.99% | Rating | √ | |||
| 10 | KDD2010 | - | - | - | - | Rating | √ | |||
| 11 | Amazon (2014 & 2018) | - | - | - | - | Rating | √ | √ | ||
| 12 | 55,187 | 9,911 | 1,445,622 | 99.74% | - | |||||
| 13 | Gowalla | 107,092 | 1,280,969 | 6,442,892 | 99.99% | Check-in | √ | √ | ||
| 14 | Last.FM | 1,892 | 17,632 | 92,834 | 99.72% | Click | √ | |||
| 15 | DIGINETICA | 204,789 | 184,047 | 993,483 | 99.99% | Click | √ | √ | ||
| 16 | Steam | 2,567,538 | 32,135 | 7,793,069 | 99.99% | Buy | √ | √ | √ | |
| 17 | Ta Feng | 32,266 | 23,812 | 817,741 | 99.89% | Click | √ | √ | √ | √ |
| 18 | Foursquare | - | - | - | - | Check-in | √ | √ |
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 9/22/2020
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.