JUHE API Marketplace
DATASET
Open Source Community

Amazon

Amazon review data includes reviews (rating, text, helpful votes) and product metadata (description, category information, price, brand, and image features), with versions from 2014 and an updated version from 2018.

Updated 5/17/2024
github

Description

Dataset Overview

Dataset Purpose

This dataset repository contains public data sources for recommender systems (RS). All these recommendation datasets can be converted into the atomic files defined by RecBole, which is a unified, comprehensive, and efficient recommendation library.

Dataset Conversion and Usage

To use RecBole, the original dataset needs to be converted into the data format defined by RecBole. Two conversion methods are provided:

  1. Download the original dataset and process it using the conversion tools provided in this repository.
  2. Directly download the processed atomic files.

Dataset Links and Introduction

Shopping

  • Amazon
    • Includes product reviews and metadata from 2014 and 2018, covering 24 categories and 142.8 million reviews.
    • The 2018 version includes 29 categories and 233.1 million reviews.
  • Amazon_M2
    • Contains anonymous customer sessions and product data from six different regions.
  • Alibaba-iFashion
    • Fashion matching dataset collected from Alibaba's online shopping system.
  • Epinions
    • Dataset containing user trust relationships collected from Epinions.com.
  • Yelp
    • Review data from the Yelp website, with multiple versions for 2018, 2020, 2021, and 2022.
  • Tmall
    • Provided by Ant Financial Services for the IJCAI16 competition.
  • DIGINETICA
    • User session data extracted from e‑commerce search engine logs.
  • YOOCHOOSE
    • Dataset built to support the RecSys Challenge 2015.
  • Retailrocket
    • Dataset collected from a real e‑commerce website.
  • Ta Feng
    • Transaction data from a Chinese grocery store from Nov 2000 to Feb 2001.

Advertising

  • Criteo
    • Contains partial traffic data from Criteo.
  • Avazu
    • Dataset used for the Avazu CTR prediction competition.
  • iPinYou
    • Training and test data from the iPinYou global RTB bidding competition.
  • AliEC
    • Dataset on click‑through rate prediction for display ads on Taobao.

Check‑in

  • Foursquare
    • About 10 months of check‑in data from New York and Tokyo.
  • Gowalla
    • Check‑in data from Feb 2009 to Oct 2010.

Movies

  • MovieLens
    • Movie rating dataset collected and provided by GroupLens.
  • Netflix
    • Official dataset for the Netflix prize competition.
  • Douban
    • Over 2 million short comments from the Douban movie site.
  • Twitch
    • Dataset of user consumption of streaming content on Twitch.

Music

  • Last.FM
    • Social network, tags, and artist listening information for 2 K users of the Last.fm online music system.
  • LFM-1b
    • Dataset containing over 1 billion music listening events.
  • Yahoo Music
    • Represents the Yahoo! Music community's preferences for various artists.
  • KGRec
    • Dataset with users, items, implicit feedback interactions, item tags, and textual descriptions.
  • Music4All-Onion
    • Extends the Music4All dataset with 26 additional audio, video, and metadata features.

Books

  • Book-Crossing
    • 278,858 users provided 1,149,780 ratings for 271,379 books.
  • GoodReads
    • Reviews and various item attributes from the Goodreads website.

Games

  • Steam
    • Includes Steam reviews and game information, with 7,793,069 reviews, 2,567,538 users, and 32,135 games.

Anime

  • Anime
    • Anime rating data from MyAnimeList.net users.

Images

  • Pinterest
    • Dataset for evaluating content‑based image recommendation in social networks.

Jokes

  • Jester
    • Anonymous joke ratings from the Jester joke recommendation system.

Exercises

  • KDD2010
    • Student practice submissions from the KDD Cup 2010 education data mining challenge.
  • EndoMondo
    • Exercise log data from EndoMondo users.

Websites

  • Phishing Websites
    • 11 features and a label indicating whether each of 11,055 sites is a phishing site.
  • Behance
    • Small anonymous version with likes and image data from the Behance community.

Adult

  • Adult
    • Dataset extracted from the 1994 Census database, containing personal attributes and whether annual income exceeds 50 k.

News

  • MIND
    • Large dataset collected for news recommendation research, containing about 160 k English news articles and over 15 million impression logs generated by 1 M users.

Food

  • DianPing
    • User reviews and detailed business metadata crawled from the Chinese online review site DianPing.com.
  • Food
    • Cooking recipes and review text from Food.com.

Beverages

  • BeerAdvocate
    • Beer reviews with multiple rating dimensions.
  • RateBeer
    • Beer reviews with multiple rating dimensions.

Clothing

  • ModCloth
    • Clothing fit measurement data from ModCloth.
  • RentTheRunway
    • Clothing fit measurement data from RentTheRunway.

Dataset Statistics

SNDataset#User#Item#InteractionSparsityInteraction TypeTimeStampUser ContextItem ContextInteraction Context
1MovieLens----Rating
2Anime73,51511,2007,813,73799.05%Rating
3Epinions116,26041,269188,47899.99%Rating
4Yelp (5 versions)----Rating
5Netflix480,18917,770100,480,50798.82%Rating
6Book-Crossing105,284340,5571,149,78099.99%Rating
7Jester73,4211014,136,36044.22%Rating
8Douban738,701282,125,05689.73%Rating
9Yahoo Music1,948,88298,21111,557,94399.99%Rating
10KDD2010----Rating
11Amazon (2014 & 2018)----Rating
12Pinterest55,1879,9111,445,62299.74%-
13Gowalla107,0921,280,9696,442,89299.99%Check-in
14Last.FM1,89217,63292,83499.72%Click
15DIGINETICA204,789184,047993,48399.99%Click
16Steam2,567,53832,1357,793,06999.99%Buy
17Ta Feng32,26623,812817,74199.89%Click
18Foursquare----Check-in

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

E-commerce
User Review Analysis

Source

Organization: github

Created: 9/22/2020

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.