Johnnyeee/Yelpdata_663
The Yelp Restaurant dataset primarily comprises user reviews, business ratings, and operational details from the Yelp platform, with a focus on the restaurant category. After processing, it is split into training and test sets, suitable for sentiment analysis, rating prediction, business analytics, and recommendation system tasks. The dataset includes multiple JSON files recording business information, user check‑ins, user reviews, user tips, and user data. Processed data contain business location, rating, review count, operating hours, and the textual content and ratings of user reviews.
Description
Dataset Card for Yelp Restaurant Dataset
Dataset Description
Original Dataset Overview
Yelp's original data contain rich information from the Yelp platform, detailing user reviews, business ratings, and operational details. Specifically, five distinct JSON datasets are provided:
yelp_academic_dataset_business.json(118.9 MB): Business information such as name, address, city, state, zip code, latitude, longitude, stars (average rating), review count, categories (e.g., restaurants, shopping), and other attributes.yelp_academic_dataset_checkin.json(287 MB): User check‑in data at businesses, including business ID and timestamps showing when users checked in.yelp_academic_dataset_review.json(5.34 GB): User reviews of businesses; each review includes user ID, business ID, star rating (1‑5), useful/funny/cool votes, review text, and review date.yelp_academic_dataset_tip.json(180.6 MB): User tips for businesses, often containing suggestions, compliments, or advice for future customers.yelp_academic_dataset_user.json(3.36 GB): User information, including user ID, name, number of reviews, join date, friends list, useful/funny/cool vote counts, fans, and average stars.
Language
The Yelp dataset is primarily in English, covering review text, business information, and user interactions.
Data Processing
In this project we use only yelp_academic_dataset_business.json and yelp_academic_dataset_review.json, focusing on restaurant data. Processing steps:
- Load the two JSON files into pandas DataFrames.
- Perform an inner join on
business_idand filter out non‑restaurant entries (i.e., rows wherecategoriesdoes not contain "restaurants"). - Randomly shuffle the Yelp restaurant dataset and split it 80/20 into training and test sets.
- The final outputs are
yelptrain_data.parquet(3,778,658 rows, 2.26 GB) andyelptest_data.parquet(943,408 rows, 591 MB).
Restaurant Dataset
Overview
yelptrain_data.parquet: Detailed business information such as location, rating, and customer reviews. Contains 3,778,658 rows, size 2.26 GB.yelptest_data.parquet: Same schema as the training set, with 943,408 rows, size 591 MB.
Expected Tasks
- Sentiment Analysis: Analyze text reviews to gauge customer sentiment, classifying opinions as positive, negative, or neutral.
- Rating Prediction: Machine‑learning models can predict potential ratings based on user and business attributes, aiding understanding of factors influencing customer satisfaction.
- Business Analytics: Examine performance metrics like average rating, review count, and operational status to provide insights into market position and customer perception.
- Recommendation System: Data can feed recommendation algorithms that suggest businesses to users based on preferences, past ratings, and similar user behaviour.
Dataset Structure
Variables
business_id: Unique identifier for each business (non‑null, object).name: Business name (non‑null, object).address: Street address (non‑null, object).city: City (non‑null, object).state: State or region (non‑null, object).postal_code: Postal code (non‑null, object).latitude: Latitude coordinate (non‑null, float64).longitude: Longitude coordinate (non‑null, float64).stars_x: Average star rating (non‑null, float64).review_count: Number of reviews (non‑null, int64).is_open: Binary flag indicating if the business is open (1 = open, 0 = closed) (non‑null, int64).attributes: Dictionary of business attributes such as "accepts credit cards", "parking", "Wi‑Fi", etc. (contains missing values, object).categories: Business categories (e.g., "restaurants", "food", "coffee & tea") (non‑null, object).hours: Opening hours (contains missing values, object).review_id: Unique identifier for each review (non‑null, object).user_id: Unique identifier for the reviewing user (non‑null, object).stars_y: Star rating given by the user (non‑null, float64).useful: Number of users who found the review useful (non‑null, int64).funny: Number of users who found the review funny (non‑null, int64).cool: Number of users who found the review cool (non‑null, int64).text: Full review text (non‑null, object).date: Review date (non‑null, object).
Sample Record
{business_id: XQfwVwDr‑v0ZS3_CbbE5Xw, name: Turning Point of North Wales, address: 1460 Bethlehem Pike, city: North Wales, state: PA, postal_code: 19454, latitude: 40.21019744873047, longitude: -75.22364044189453, stars_x: 3.0, review_count: 169.0, is_open: 1.0, categories: Restaurants, Breakfast & Brunch, Food, Juice Bars & Smoothies, American (New), Coffee & Tea, Sandwiches, hours: {"Monday": "7:30-15:0", "Tuesday": "7:30-15:0", "Wednesday": "7:30-15:0", "Thursday": "7:30-15:0", "Friday": "7:30-15:0", "Saturday": "7:30-15:0", "Sunday": "7:30-15:0"}, review_id: KU_O5udG6zpxOg‑VcAEodg, user_id: mh_-eMZ6K5RLWhZyISBhwA, stars_y: 3.0, useful: 0.0, funny: 0.0, cool: 0.0, text: "If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to its other locations in NJ and never had a bad experience. \n\nThe food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.", date: 2018-07-07 22:09:11, attributes: {"NoiseLevel": "uaverage", "HasTV": "False", "RestaurantsAttire": "casual", "BikeParking": "False", "Ambience": "{\touristy: False, hipster: False, \romantic: False, divey: False, intimate: False, \trendy: False, upscale: False, classy: False, casual: True}", "WiFi": "free", "DogsAllowed": "False", "Alcohol": "\none", "BusinessAcceptsCreditCards": "True", "RestaurantsGoodForGroups": "True", "RestaurantsPriceRange2": "2", "RestaurantsReservations": "False", "WheelchairAccessible": "True", "BusinessAcceptsBitcoin": "False", "RestaurantsTableService": "True", "GoodForKids": "True", "Caters": "False", "HappyHour": "False", "RestaurantsDelivery": "True", "GoodForMeal": "{dessert: False, latenight: False, lunch: True, dinner: False, \brunch: True, \breakfast: True}", "OutdoorSeating": "True", "RestaurantsTakeOut": "True", "BusinessParking": "{garage: False, street: False, validated: False, lot: True, valet: False}"}}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.