Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Category index
Showing 30 of 30 datasets
Category: Data Analysis

new_york_citibike

Bike SharingData Analysis

This public dataset contains two BigQuery tables; the table used is `citybike_trips`, containing over 58 million records. The `tripduration` field indicates the duration of each bike rental (in seconds); other fields serve as potential features.

Source githubUpdated Jun 27, 2024341 viewsLinked
Inspect dataset

bodyfat dataset

Health MonitoringData Analysis

The dataset records body‑fat percentage, age, weight, height, and ten body‑circumference measurements (e.g., waist) for 252 male subjects. Body‑fat, a health indicator, is accurately estimated via underwater weighing. By applying multivariate regression, body‑fat can be predicted using only a scale and a measuring tape, providing a convenient method for estimating male body‑fat.

Source githubUpdated Nov 19, 2021305 viewsLinked
Inspect dataset

Brazilian E-Commerce Public Dataset by Olist

E-commerceData Analysis

The Brazilian E‑Commerce Public Dataset by Olist contains order information from 2016‑2018 across multiple marketplaces in Brazil, with 100,000 orders. Features allow multi‑dimensional analysis of orders, including status, price, payment, shipping performance, customer location, product attributes, and customer reviews. A geographic dataset with latitude‑longitude coordinates linked to Brazilian postal codes is also provided.

Source githubUpdated May 13, 2024447 viewsLinked
Inspect dataset

foursquare-dataset

Location RecommendationData Analysis

This dataset was collected for the TREC 2016 Contextual Suggestion Track and contains 228,778 points of interest such as parks, restaurants, and museums. Stored in JSON format, the data were gathered between January 30, 2017 and February 22, 2017. It is intended for research use only and must be cited according to the associated papers.

Source githubUpdated Apr 7, 2023323 viewsLinked
Inspect dataset

torchgeo/skippd

Solar Energy PredictionData Analysis

2017‑2019 sky images and photovoltaic power generation dataset for short‑term solar forecasting (Stanford benchmark).

Source hugging_faceUpdated Jun 18, 2024172 viewsLinked
Inspect dataset

United States of America Gun violence Dataset

Gun ViolenceData Analysis

This report provides a detailed analysis of US gun violence data collected from 2013 to 2018, aiming to better understand the hazards of US gun culture. The analysis integrates information on age groups, gender, states, locations, as well as socioeconomic data such as population, per‑capita income, and unemployment rates to predict the most dangerous and safest states. Additionally, the report attempts to forecast which months and weekdays are more dangerous or safer for citizens, generating a risk score to predict the safest month, day, and state.

Source githubUpdated Aug 7, 2020159 viewsLinked
Inspect dataset

Global Terrorism Dataset

TerrorismData Analysis

The Global Terrorism Dataset contains data on over 181,000 terrorist attacks worldwide. In addition to city and country information, the dataset provides latitude and longitude for each incident, offering precise location data useful for visualization. This dataset can help identify solutions to various problems through data analysis and visualization techniques. By leveraging its rich features, one can assess attack intensity by year and region, detect temporal or geographic trends, and examine relationships between attack characteristics and success or failure rates. Such analysis assists governmental organizations in making informed decisions to enhance public safety and prepare for potential attacks.

Source githubUpdated May 26, 2020191 viewsLinked
Inspect dataset

Polyvore Dataset

Fashion MatchingData Analysis

This dataset includes 21,889 outfits from polyvore.com for training, validation, and testing. Each outfit contains name, view count, clothing items, image URL, likes, upload date, and description. The dataset also provides question‑answer pairs for evaluating fill‑in‑the‑blank fashion recommendation tasks, as well as data for fashion compatibility prediction tasks.

Source githubUpdated May 8, 2024850 viewsLinked
Inspect dataset

HuggingFaceFW/fineweb-edu

Educational ContentData Analysis

The FineWeb‑Edu dataset is a selection of educational web pages from the FineWeb dataset, containing 1.3 trillion tokens. Using LLama3‑70B‑Instruct generated annotations, a quality classifier was trained to select high‑quality educational content. This dataset aims to provide high‑quality educational data for language‑model training.

Source hugging_faceUpdated Oct 11, 20241,717 viewsLinked
Inspect dataset

LimYeri/LeetCode_with_Solutions

Programming Problem RepositoryData Analysis

The dataset, named LeetCode with Solutions, contains solutions to LeetCode problems. Dataset features include problem ID, problem content, title slug, tags, difficulty level, problem hints, and content. The dataset is split into a training set with 34,903 examples, sized 119,458,837 bytes. It is used for text generation tasks, primarily in English, with the label 'code'.

Source hugging_faceUpdated Apr 12, 2024119 viewsLinked
Inspect dataset

subsplease_animes

AnimeData Analysis

This is an integrated anime database combining data from subsplease, MyAnimeList, and Nyaa.si. Users can discover the most popular anime and those with reliable torrent magnet links. The database updates daily and includes 770 anime titles and a total of 11,137 episodes, each with detailed information such as ID, title, type, episode count, status, rating, Nyaa search link, magnet links, seed count, download count, and last update time.

Source huggingfaceUpdated Jul 19, 2024278 viewsLinked
Inspect dataset

Ford GoBike Trip dataset

Bike SharingData Analysis

The Ford GoBike Trip dataset contains information on individual rides from a bike‑sharing system, covering the San Francisco Bay Area and surrounding regions. Each trip is anonymized and includes trip duration (seconds), start time and date, end time and date, start station ID, start station name, start station latitude, start station longitude, end station ID, end station name, end station latitude, end station longitude, bike ID, user type (subscriber or customer), member birth year, and member gender.

Source githubUpdated Dec 16, 2020129 viewsLinked
Inspect dataset

EATD-Corpus

Mental HealthData Analysis

EATD-Corpus is a dataset of audio and text files from 162 volunteers who received counseling. The training set contains data from 83 volunteers (19 depressed and 64 non‑depressed), and the validation set contains data from 79 volunteers (11 depressed and 68 non‑depressed). Each folder contains a volunteer’s depression data, including raw audio, preprocessed audio, audio transcripts, and depression scores.

Source githubUpdated Jul 10, 20231,042 viewsLinked
Inspect dataset

CS:GO Pro Matches Comprehensive Dataset

EsportsData Analysis

The dataset comprises all professional CS:GO matches from 2012 to 2023, totaling 126,872 matches, each with 155 distinct data points.

Source githubUpdated Feb 1, 2024831 viewsLinked
Inspect dataset

Call-Center-Dataset

Call CenterData Analysis

This dataset contains call‑center performance data analyzed with Power BI. It provides key performance indicators (KPIs), call volume trends, and agent performance insights to help stakeholders understand operational efficiency, identify improvement areas, and make data‑driven decisions.

Source githubUpdated Aug 30, 2024416 viewsLinked
Inspect dataset

Indian Patent Dataset

Patent DataData Analysis

The Indian Patent Dataset provides detailed information on all patent applications submitted in India in 2010, 2011, and 2019, including application number, title, filing date, inventor and applicant information, patent status, etc. This dataset aims to offer valuable insights into the Indian patent landscape for researchers, policy makers, businesses, and academia, supporting research and analysis, policy decision‑making, business intelligence, legal compliance, and scholarly research.

Source githubUpdated Apr 28, 2024245 viewsLinked
Inspect dataset

Tableau Data Visualization Projects

Data VisualizationData Analysis

The dataset contains various data files for Tableau data‑visualization projects, providing insights generated through interactive dashboards and analytical reports.

Source githubUpdated Nov 3, 2024220 viewsLinked
Inspect dataset

FiveThirtyEight Food Frequency Questionnaire

Food IntakeData Analysis

FiveThirtyEight readers' responses to the Food Frequency Questionnaire dataset, used for research and analysis of food intake frequency.

Source githubUpdated Apr 12, 2024318 viewsLinked
Inspect dataset

oo00spy00oo/twitter_dataset_1717762741

Social MediaData Analysis

This dataset includes multiple features such as tweet content, user name, user ID, etc., suitable for training models. The download size of the dataset is 2160 bytes, but the actual data size is 0 bytes.

Source hugging_faceUpdated Jun 7, 202495 viewsLinked
Inspect dataset

video-ads-dataset

Video AdvertisingData Analysis

This dataset contains research data on YouTube video advertisement consumption, specifically including video‑ad impressions, ad‑API information, watch‑API information, and daily view‑count time‑series for ads.

Source githubUpdated May 15, 2023203 viewsLinked
Inspect dataset

CityBench-CityData

Urban DataData Analysis

这是一个包含13个现有城市的CityData数据集,用于CityBench研究。数据集包括下载和解压zip文件,并将提取的`citydata`文件夹放置在`CityBench`目录下。

Source huggingfaceUpdated Dec 22, 2024201 viewsLinked
Inspect dataset

maomlab/AqSolDB

ChemistryData Analysis

AqSolDB, created by the Autonomous Materials Discovery (AMD) research group, contains aqueous solubility data for 9,982 unique compounds aggregated from nine publicly available soluble datasets. It is the largest publicly accessible dataset of its kind, serving both as a valuable reference for measured solubility and as improved, generalizable training data for data‑driven models. The dataset provides 2D descriptors for compounds, with standardized and validated molecular representations and reliability labels.

Source hugging_faceUpdated Aug 1, 2025489 viewsLinked
Inspect dataset

Johnnyeee/Yelpdata_663

Restaurant ReviewsData Analysis

The Yelp Restaurant dataset primarily comprises user reviews, business ratings, and operational details from the Yelp platform, with a focus on the restaurant category. After processing, it is split into training and test sets, suitable for sentiment analysis, rating prediction, business analytics, and recommendation system tasks. The dataset includes multiple JSON files recording business information, user check‑ins, user reviews, user tips, and user data. Processed data contain business location, rating, review count, operating hours, and the textual content and ratings of user reviews.

Source hugging_faceUpdated Mar 14, 2024312 viewsLinked
Inspect dataset

PIX Payment Transaction Dataset

Payment SystemsData Analysis

The dataset consists of historical transaction records from a POS system after automation, focusing on PIX payment transactions. It is intended for analyzing transaction behavior, detecting anomalies and trends, and providing visual insights via heatmaps.

Source githubUpdated Sep 20, 2024151 viewsLinked
Inspect dataset

LAMBDA

Video AdvertisingData Analysis

This dataset is primarily used for analyzing and evaluating the effectiveness of video advertisements. It includes video identifiers (video_id), recall scores (recall_score), YouTube video IDs (youtube_id), and ad details (ad_details). The ad_details field is a structured feature containing sub‑features such as Audio, Brand, Duration, etc. The dataset is split into a training set (1,964 samples) and a test set (219 samples). Total size is 5,707,189 bytes, with a download size of 2,281,142 bytes.

Source huggingfaceUpdated Jul 3, 2024182 viewsLinked
Inspect dataset

Adult Data Set

Revenue ForecastingData Analysis

This is a widely used dataset for predicting whether income exceeds $50K per year based on provided census data.

Source githubUpdated Jan 31, 2019169 viewsLinked
Inspect dataset

Hiring Decision Analysis Dataset

Hiring DecisionData Analysis

The dataset contains multiple variables related to recruitment decisions, such as age, gender, education level, work experience, number of previous employers, distance to the company, interview score, skill score, personality score, and recruitment strategy. The target variable is the recruitment decision, classified as hired or not hired.

Source githubUpdated Jul 24, 2024264 viewsLinked
Inspect dataset

fabul0us/football_odds_2023-24

Football OddsData Analysis

This dataset contains football betting odds for the 2023‑24 season, covering the Champions League, Europa League, and the five major national leagues (Premier League, La Liga, Serie A, Bundesliga, Ligue 1). For each match, odds information includes 1X2, double chance, over/under, 2.5‑goal lines, etc., with corresponding vigourish values computed. Odds were collected multiple times in the days preceding each match, but real‑time odds were not captured. Additionally, pre‑match odds for each competition were collected repeatedly from the start to the end of the competition.

Source hugging_faceUpdated Jun 2, 20241,004 viewsLinked
Inspect dataset

ames_iowa_housing

Real EstateData Analysis

This dataset contains information on residential properties sold in Ames, Iowa, USA, from 2006 to 2010, compiled by Dean De Cock. It serves as an educational resource to replace the older Boston Housing dataset. Detailed documentation is available in `./originals/DataDocumentation.txt`; structured feature metadata are manually extracted into `./features.json`. The primary data file is `AmesHousing.csv`, a lightly pre‑processed version of the original data.

Source huggingfaceUpdated Dec 19, 2024239 viewsLinked
Inspect dataset

NEMSIS Dataset

Emergency Medical ServicesData Analysis

This project aims to evaluate various data imputation methods using Emergency Medical Services (EMS) data from the National Emergency Medical Services Information System (NEMSIS), focusing on MICE and MissForest, to identify predictors of ICU cardiac arrest outcomes, particularly with respect to urban versus rural settings.

Source githubUpdated Apr 3, 2024126 viewsLinked
Inspect dataset