High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

new_york_citibike

This public dataset contains two BigQuery tables; the table used is `citybike_trips`, containing over 58 million records. The `tripduration` field indicates the duration of each bike rental (in seconds); other fields serve as potential features.

github

View Details

bodyfat dataset

Health Monitoring

Data Analysis

The dataset records body‑fat percentage, age, weight, height, and ten body‑circumference measurements (e.g., waist) for 252 male subjects. Body‑fat, a health indicator, is accurately estimated via underwater weighing. By applying multivariate regression, body‑fat can be predicted using only a scale and a measuring tape, providing a convenient method for estimating male body‑fat.

github

View Details

Brazilian E-Commerce Public Dataset by Olist

E-commerce

Data Analysis

The Brazilian E‑Commerce Public Dataset by Olist contains order information from 2016‑2018 across multiple marketplaces in Brazil, with 100,000 orders. Features allow multi‑dimensional analysis of orders, including status, price, payment, shipping performance, customer location, product attributes, and customer reviews. A geographic dataset with latitude‑longitude coordinates linked to Brazilian postal codes is also provided.

github

View Details

foursquare-dataset

Location Recommendation

Data Analysis

This dataset was collected for the TREC 2016 Contextual Suggestion Track and contains 228,778 points of interest such as parks, restaurants, and museums. Stored in JSON format, the data were gathered between January 30, 2017 and February 22, 2017. It is intended for research use only and must be cited according to the associated papers.

github

View Details

torchgeo/skippd

Solar Energy Prediction

Data Analysis

2017‑2019 sky images and photovoltaic power generation dataset for short‑term solar forecasting (Stanford benchmark).

hugging_face

View Details

United States of America Gun violence Dataset

Gun Violence

Data Analysis

This report provides a detailed analysis of US gun violence data collected from 2013 to 2018, aiming to better understand the hazards of US gun culture. The analysis integrates information on age groups, gender, states, locations, as well as socioeconomic data such as population, per‑capita income, and unemployment rates to predict the most dangerous and safest states. Additionally, the report attempts to forecast which months and weekdays are more dangerous or safer for citizens, generating a risk score to predict the safest month, day, and state.

github

View Details

Global Terrorism Dataset

Terrorism

Data Analysis

The Global Terrorism Dataset contains data on over 181,000 terrorist attacks worldwide. In addition to city and country information, the dataset provides latitude and longitude for each incident, offering precise location data useful for visualization. This dataset can help identify solutions to various problems through data analysis and visualization techniques. By leveraging its rich features, one can assess attack intensity by year and region, detect temporal or geographic trends, and examine relationships between attack characteristics and success or failure rates. Such analysis assists governmental organizations in making informed decisions to enhance public safety and prepare for potential attacks.

github

View Details

Polyvore Dataset

Fashion Matching

Data Analysis

This dataset includes 21,889 outfits from polyvore.com for training, validation, and testing. Each outfit contains name, view count, clothing items, image URL, likes, upload date, and description. The dataset also provides question‑answer pairs for evaluating fill‑in‑the‑blank fashion recommendation tasks, as well as data for fashion compatibility prediction tasks.

github

View Details

HuggingFaceFW/fineweb-edu

Educational Content

Data Analysis

The FineWeb‑Edu dataset is a selection of educational web pages from the FineWeb dataset, containing 1.3 trillion tokens. Using LLama3‑70B‑Instruct generated annotations, a quality classifier was trained to select high‑quality educational content. This dataset aims to provide high‑quality educational data for language‑model training.

hugging_face

View Details

LimYeri/LeetCode_with_Solutions

Programming Problem Repository

Data Analysis

The dataset, named LeetCode with Solutions, contains solutions to LeetCode problems. Dataset features include problem ID, problem content, title slug, tags, difficulty level, problem hints, and content. The dataset is split into a training set with 34,903 examples, sized 119,458,837 bytes. It is used for text generation tasks, primarily in English, with the label 'code'.

hugging_face

View Details

subsplease_animes

Anime

Data Analysis

This is an integrated anime database combining data from subsplease, MyAnimeList, and Nyaa.si. Users can discover the most popular anime and those with reliable torrent magnet links. The database updates daily and includes 770 anime titles and a total of 11,137 episodes, each with detailed information such as ID, title, type, episode count, status, rating, Nyaa search link, magnet links, seed count, download count, and last update time.

huggingface

View Details

Ford GoBike Trip dataset

Bike Sharing

Data Analysis

The Ford GoBike Trip dataset contains information on individual rides from a bike‑sharing system, covering the San Francisco Bay Area and surrounding regions. Each trip is anonymized and includes trip duration (seconds), start time and date, end time and date, start station ID, start station name, start station latitude, start station longitude, end station ID, end station name, end station latitude, end station longitude, bike ID, user type (subscriber or customer), member birth year, and member gender.

github

View Details

EATD-Corpus

Mental Health

Data Analysis

EATD-Corpus is a dataset of audio and text files from 162 volunteers who received counseling. The training set contains data from 83 volunteers (19 depressed and 64 non‑depressed), and the validation set contains data from 79 volunteers (11 depressed and 68 non‑depressed). Each folder contains a volunteer’s depression data, including raw audio, preprocessed audio, audio transcripts, and depression scores.

github

View Details

CS:GO Pro Matches Comprehensive Dataset

Esports

Data Analysis

The dataset comprises all professional CS:GO matches from 2012 to 2023, totaling 126,872 matches, each with 155 distinct data points.

github

View Details

Call-Center-Dataset

Call Center

Data Analysis

This dataset contains call‑center performance data analyzed with Power BI. It provides key performance indicators (KPIs), call volume trends, and agent performance insights to help stakeholders understand operational efficiency, identify improvement areas, and make data‑driven decisions.

github

View Details

Indian Patent Dataset

Patent Data

Data Analysis

The Indian Patent Dataset provides detailed information on all patent applications submitted in India in 2010, 2011, and 2019, including application number, title, filing date, inventor and applicant information, patent status, etc. This dataset aims to offer valuable insights into the Indian patent landscape for researchers, policy makers, businesses, and academia, supporting research and analysis, policy decision‑making, business intelligence, legal compliance, and scholarly research.

github

View Details

Tableau Data Visualization Projects

Data Visualization

Data Analysis

The dataset contains various data files for Tableau data‑visualization projects, providing insights generated through interactive dashboards and analytical reports.

github

View Details

FiveThirtyEight Food Frequency Questionnaire

Food Intake

Data Analysis

FiveThirtyEight readers' responses to the Food Frequency Questionnaire dataset, used for research and analysis of food intake frequency.

github

View Details

oo00spy00oo/twitter_dataset_1717762741

Social Media

Data Analysis

This dataset includes multiple features such as tweet content, user name, user ID, etc., suitable for training models. The download size of the dataset is 2160 bytes, but the actual data size is 0 bytes.

hugging_face

View Details

video-ads-dataset

Video Advertising

Data Analysis

This dataset contains research data on YouTube video advertisement consumption, specifically including video‑ad impressions, ad‑API information, watch‑API information, and daily view‑count time‑series for ads.

github

View Details

CityBench-CityData

Urban Data

Data Analysis

这是一个包含13个现有城市的CityData数据集，用于CityBench研究。数据集包括下载和解压zip文件，并将提取的`citydata`文件夹放置在`CityBench`目录下。

huggingface

View Details

maomlab/AqSolDB

Chemistry

Data Analysis

AqSolDB, created by the Autonomous Materials Discovery (AMD) research group, contains aqueous solubility data for 9,982 unique compounds aggregated from nine publicly available soluble datasets. It is the largest publicly accessible dataset of its kind, serving both as a valuable reference for measured solubility and as improved, generalizable training data for data‑driven models. The dataset provides 2D descriptors for compounds, with standardized and validated molecular representations and reliability labels.

hugging_face

View Details

Johnnyeee/Yelpdata_663

Restaurant Reviews

Data Analysis

The Yelp Restaurant dataset primarily comprises user reviews, business ratings, and operational details from the Yelp platform, with a focus on the restaurant category. After processing, it is split into training and test sets, suitable for sentiment analysis, rating prediction, business analytics, and recommendation system tasks. The dataset includes multiple JSON files recording business information, user check‑ins, user reviews, user tips, and user data. Processed data contain business location, rating, review count, operating hours, and the textual content and ratings of user reviews.

hugging_face

View Details

PIX Payment Transaction Dataset

Payment Systems

Data Analysis

The dataset consists of historical transaction records from a POS system after automation, focusing on PIX payment transactions. It is intended for analyzing transaction behavior, detecting anomalies and trends, and providing visual insights via heatmaps.

github

View Details

LAMBDA

Video Advertising

Data Analysis

This dataset is primarily used for analyzing and evaluating the effectiveness of video advertisements. It includes video identifiers (video_id), recall scores (recall_score), YouTube video IDs (youtube_id), and ad details (ad_details). The ad_details field is a structured feature containing sub‑features such as Audio, Brand, Duration, etc. The dataset is split into a training set (1,964 samples) and a test set (219 samples). Total size is 5,707,189 bytes, with a download size of 2,281,142 bytes.

huggingface

View Details

Adult Data Set

Revenue Forecasting

Data Analysis

This is a widely used dataset for predicting whether income exceeds $50K per year based on provided census data.

github

View Details

Hiring Decision Analysis Dataset

Hiring Decision

Data Analysis

The dataset contains multiple variables related to recruitment decisions, such as age, gender, education level, work experience, number of previous employers, distance to the company, interview score, skill score, personality score, and recruitment strategy. The target variable is the recruitment decision, classified as hired or not hired.

github

View Details

fabul0us/football_odds_2023-24

Football Odds

Data Analysis

This dataset contains football betting odds for the 2023‑24 season, covering the Champions League, Europa League, and the five major national leagues (Premier League, La Liga, Serie A, Bundesliga, Ligue 1). For each match, odds information includes 1X2, double chance, over/under, 2.5‑goal lines, etc., with corresponding vigourish values computed. Odds were collected multiple times in the days preceding each match, but real‑time odds were not captured. Additionally, pre‑match odds for each competition were collected repeatedly from the start to the end of the competition.

hugging_face

View Details

ames_iowa_housing

Real Estate

Data Analysis

This dataset contains information on residential properties sold in Ames, Iowa, USA, from 2006 to 2010, compiled by Dean De Cock. It serves as an educational resource to replace the older Boston Housing dataset. Detailed documentation is available in `./originals/DataDocumentation.txt`; structured feature metadata are manually extracted into `./features.json`. The primary data file is `AmesHousing.csv`, a lightly pre‑processed version of the original data.

huggingface

View Details

NEMSIS Dataset

Emergency Medical Services

Data Analysis

This project aims to evaluate various data imputation methods using Emergency Medical Services (EMS) data from the National Emergency Medical Services Information System (NEMSIS), focusing on MICE and MissForest, to identify predictors of ICU cardiac arrest outcomes, particularly with respect to urban versus rural settings.

github

View Details