Explore high-quality datasets for your AI and machine learning projects.
This public dataset contains two BigQuery tables; the table used is `citybike_trips`, containing over 58 million records. The `tripduration` field indicates the duration of each bike rental (in seconds); other fields serve as potential features.
The dataset records body‑fat percentage, age, weight, height, and ten body‑circumference measurements (e.g., waist) for 252 male subjects. Body‑fat, a health indicator, is accurately estimated via underwater weighing. By applying multivariate regression, body‑fat can be predicted using only a scale and a measuring tape, providing a convenient method for estimating male body‑fat.
The Brazilian E‑Commerce Public Dataset by Olist contains order information from 2016‑2018 across multiple marketplaces in Brazil, with 100,000 orders. Features allow multi‑dimensional analysis of orders, including status, price, payment, shipping performance, customer location, product attributes, and customer reviews. A geographic dataset with latitude‑longitude coordinates linked to Brazilian postal codes is also provided.
This dataset was collected for the TREC 2016 Contextual Suggestion Track and contains 228,778 points of interest such as parks, restaurants, and museums. Stored in JSON format, the data were gathered between January 30, 2017 and February 22, 2017. It is intended for research use only and must be cited according to the associated papers.
2017‑2019 sky images and photovoltaic power generation dataset for short‑term solar forecasting (Stanford benchmark).
This report provides a detailed analysis of US gun violence data collected from 2013 to 2018, aiming to better understand the hazards of US gun culture. The analysis integrates information on age groups, gender, states, locations, as well as socioeconomic data such as population, per‑capita income, and unemployment rates to predict the most dangerous and safest states. Additionally, the report attempts to forecast which months and weekdays are more dangerous or safer for citizens, generating a risk score to predict the safest month, day, and state.
The Global Terrorism Dataset contains data on over 181,000 terrorist attacks worldwide. In addition to city and country information, the dataset provides latitude and longitude for each incident, offering precise location data useful for visualization. This dataset can help identify solutions to various problems through data analysis and visualization techniques. By leveraging its rich features, one can assess attack intensity by year and region, detect temporal or geographic trends, and examine relationships between attack characteristics and success or failure rates. Such analysis assists governmental organizations in making informed decisions to enhance public safety and prepare for potential attacks.
This dataset includes 21,889 outfits from polyvore.com for training, validation, and testing. Each outfit contains name, view count, clothing items, image URL, likes, upload date, and description. The dataset also provides question‑answer pairs for evaluating fill‑in‑the‑blank fashion recommendation tasks, as well as data for fashion compatibility prediction tasks.
The FineWeb‑Edu dataset is a selection of educational web pages from the FineWeb dataset, containing 1.3 trillion tokens. Using LLama3‑70B‑Instruct generated annotations, a quality classifier was trained to select high‑quality educational content. This dataset aims to provide high‑quality educational data for language‑model training.
The dataset, named LeetCode with Solutions, contains solutions to LeetCode problems. Dataset features include problem ID, problem content, title slug, tags, difficulty level, problem hints, and content. The dataset is split into a training set with 34,903 examples, sized 119,458,837 bytes. It is used for text generation tasks, primarily in English, with the label 'code'.
This is an integrated anime database combining data from subsplease, MyAnimeList, and Nyaa.si. Users can discover the most popular anime and those with reliable torrent magnet links. The database updates daily and includes 770 anime titles and a total of 11,137 episodes, each with detailed information such as ID, title, type, episode count, status, rating, Nyaa search link, magnet links, seed count, download count, and last update time.
The Ford GoBike Trip dataset contains information on individual rides from a bike‑sharing system, covering the San Francisco Bay Area and surrounding regions. Each trip is anonymized and includes trip duration (seconds), start time and date, end time and date, start station ID, start station name, start station latitude, start station longitude, end station ID, end station name, end station latitude, end station longitude, bike ID, user type (subscriber or customer), member birth year, and member gender.
EATD-Corpus is a dataset of audio and text files from 162 volunteers who received counseling. The training set contains data from 83 volunteers (19 depressed and 64 non‑depressed), and the validation set contains data from 79 volunteers (11 depressed and 68 non‑depressed). Each folder contains a volunteer’s depression data, including raw audio, preprocessed audio, audio transcripts, and depression scores.
The dataset comprises all professional CS:GO matches from 2012 to 2023, totaling 126,872 matches, each with 155 distinct data points.
This dataset contains call‑center performance data analyzed with Power BI. It provides key performance indicators (KPIs), call volume trends, and agent performance insights to help stakeholders understand operational efficiency, identify improvement areas, and make data‑driven decisions.
The Indian Patent Dataset provides detailed information on all patent applications submitted in India in 2010, 2011, and 2019, including application number, title, filing date, inventor and applicant information, patent status, etc. This dataset aims to offer valuable insights into the Indian patent landscape for researchers, policy makers, businesses, and academia, supporting research and analysis, policy decision‑making, business intelligence, legal compliance, and scholarly research.
The dataset contains various data files for Tableau data‑visualization projects, providing insights generated through interactive dashboards and analytical reports.
FiveThirtyEight readers' responses to the Food Frequency Questionnaire dataset, used for research and analysis of food intake frequency.
This dataset includes multiple features such as tweet content, user name, user ID, etc., suitable for training models. The download size of the dataset is 2160 bytes, but the actual data size is 0 bytes.
This dataset contains research data on YouTube video advertisement consumption, specifically including video‑ad impressions, ad‑API information, watch‑API information, and daily view‑count time‑series for ads.
这是一个包含13个现有城市的CityData数据集,用于CityBench研究。数据集包括下载和解压zip文件,并将提取的`citydata`文件夹放置在`CityBench`目录下。
AqSolDB, created by the Autonomous Materials Discovery (AMD) research group, contains aqueous solubility data for 9,982 unique compounds aggregated from nine publicly available soluble datasets. It is the largest publicly accessible dataset of its kind, serving both as a valuable reference for measured solubility and as improved, generalizable training data for data‑driven models. The dataset provides 2D descriptors for compounds, with standardized and validated molecular representations and reliability labels.
The Yelp Restaurant dataset primarily comprises user reviews, business ratings, and operational details from the Yelp platform, with a focus on the restaurant category. After processing, it is split into training and test sets, suitable for sentiment analysis, rating prediction, business analytics, and recommendation system tasks. The dataset includes multiple JSON files recording business information, user check‑ins, user reviews, user tips, and user data. Processed data contain business location, rating, review count, operating hours, and the textual content and ratings of user reviews.
The dataset consists of historical transaction records from a POS system after automation, focusing on PIX payment transactions. It is intended for analyzing transaction behavior, detecting anomalies and trends, and providing visual insights via heatmaps.
This dataset is primarily used for analyzing and evaluating the effectiveness of video advertisements. It includes video identifiers (video_id), recall scores (recall_score), YouTube video IDs (youtube_id), and ad details (ad_details). The ad_details field is a structured feature containing sub‑features such as Audio, Brand, Duration, etc. The dataset is split into a training set (1,964 samples) and a test set (219 samples). Total size is 5,707,189 bytes, with a download size of 2,281,142 bytes.
This is a widely used dataset for predicting whether income exceeds $50K per year based on provided census data.
The dataset contains multiple variables related to recruitment decisions, such as age, gender, education level, work experience, number of previous employers, distance to the company, interview score, skill score, personality score, and recruitment strategy. The target variable is the recruitment decision, classified as hired or not hired.
This dataset contains football betting odds for the 2023‑24 season, covering the Champions League, Europa League, and the five major national leagues (Premier League, La Liga, Serie A, Bundesliga, Ligue 1). For each match, odds information includes 1X2, double chance, over/under, 2.5‑goal lines, etc., with corresponding vigourish values computed. Odds were collected multiple times in the days preceding each match, but real‑time odds were not captured. Additionally, pre‑match odds for each competition were collected repeatedly from the start to the end of the competition.
This dataset contains information on residential properties sold in Ames, Iowa, USA, from 2006 to 2010, compiled by Dean De Cock. It serves as an educational resource to replace the older Boston Housing dataset. Detailed documentation is available in `./originals/DataDocumentation.txt`; structured feature metadata are manually extracted into `./features.json`. The primary data file is `AmesHousing.csv`, a lightly pre‑processed version of the original data.
This project aims to evaluate various data imputation methods using Emergency Medical Services (EMS) data from the National Emergency Medical Services Information System (NEMSIS), focusing on MICE and MissForest, to identify predictors of ICU cardiac arrest outcomes, particularly with respect to urban versus rural settings.