Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Showing 90 of 1,150 datasets

All categories

Selfie-with-ID

AuthenticationFace Recognition

The dataset contains over 65,000 photos of more than 5,000 individuals from 40 countries, providing a valuable resource for exploring and developing authentication solutions. It is especially suitable for biometric verification, notably facial recognition in financial services. Each individual includes 13 selfie images and 2 ID photos captured with various devices and resolutions. The dataset aims to develop more robust re‑identification algorithms and enhance security measures across applications.

Source huggingfaceUpdated Nov 13, 2024314 viewsLinked

Inspect dataset

HUSTbearing dataset

Bearing Health MonitoringFault Diagnosis

The dataset contains vibration signals from bearings in nine different health states under four different operating conditions. It is publicly available for validating rolling‑bearing diagnostic algorithms.

Source githubUpdated May 19, 20242,239 viewsLinked

Inspect dataset

Chinese-web-novel

Web NovelsText Data

The dataset crawls up to 25 chapters per book from https://m.bqgui.cc, resulting in 12,740 entries. After three rounds of cleaning, each entry contains the book title, summary, and novel text. Titles are of high quality, summaries have low usability, and the novel texts have had some ads and symbols removed but still contain low‑quality content.

Source huggingfaceUpdated Oct 16, 20241,461 viewsLinked

Inspect dataset

hardware-fab/Chameleon

Side‑Channel AnalysisCryptographic Techniques

The Chameleon dataset is designed for side‑channel analysis and contains real power‑trace recordings collected from a 32‑bit RISC‑V system‑on‑chip that implements four masking countermeasures (dynamic frequency scaling, random delay, morphing, and chaffing). The traces capture interleaved execution of AES encryption operations with general‑purpose applications. The dataset is divided into four sub‑datasets, each corresponding to one countermeasure, and each sub‑dataset is further split into 16 files based on the value of the first byte of the encryption key. It supports research on segmented methods and side‑channel analysis techniques, especially for devices employing masking countermeasures.

Source hugging_faceUpdated Apr 15, 2025176 viewsLinked

Inspect dataset

PandaVT/Machine_Mindset_MBTI_dataset

MBTI Personality AnalysisMachine Learning

This dataset is for supervised fine‑tuning (SFT) and direct preference optimization (DPO), available in English and Chinese versions. It is based on the four MBTI dimensions, each with two opposing attributes: energy (Extraversion E – Introversion I), information (Sensing S – Intuition N), decision (Thinking T – Feeling F), and execution (Judging J – Perceiving P). The dataset follows the Alpaca format, containing instruction, input, and output. Users can select the appropriate file for SFT or DPO based on the MBTI type.

Source hugging_faceUpdated Jun 4, 2024123 viewsLinked

Inspect dataset

PolyUDataset

Image DenoisingImage Processing

This dataset provides real noise images for denoising research, containing 40 different scenes captured by 5 leading‑brand cameras, totaling 100 regions of size 512 × 512, including noisy images and corresponding ground‑truth images.

Source githubUpdated May 10, 2024532 viewsLinked

Inspect dataset

smoltalk-chinese

Language ModelsChinese Language Processing

smoltalk‑chinese is a Chinese fine‑tuning dataset referenced from the SmolTalk dataset, designed to provide high‑quality synthetic data for training large language models (LLMs). The dataset consists entirely of synthetic data, covering more than 700,000 entries, and is composed of multiple parts including tasks referenced from magpie‑ultra, other SmolTalk tasks, simulated daily‑life dialogues, and mathematics problems from the Chinese version of Math23K. The generation process follows strict standards to ensure data quality and diversity. Experiments show that models fine‑tuned on smoltalk‑chinese achieve significant advantages on multiple metrics.

Source huggingfaceUpdated Jan 2, 2025247 viewsLinked

Inspect dataset

UW-Bench

Urban Flood DetectionImage Dataset

The UW‑Bench dataset, created by the School of Microelectronics and Communication Engineering at Chongqing University, focuses on urban flood detection, containing 7,677 annotated images sourced from surveillance cameras and handheld devices. The dataset was collected under various adverse conditions such as low light, strong reflections, and clear water, aiming to improve model generalization in real‑world applications. Manual annotation ensures data quality, suitable for enhancing accuracy and efficiency of urban flood detection.

Source arXivUpdated Jul 11, 2024679 viewsLinked

Inspect dataset

new_york_citibike

Bike SharingData Analysis

This public dataset contains two BigQuery tables; the table used is `citybike_trips`, containing over 58 million records. The `tripduration` field indicates the duration of each bike rental (in seconds); other fields serve as potential features.

Source githubUpdated Jun 27, 2024339 viewsLinked

Inspect dataset

PHDF-Dataset

Pedestrian DetectionComputer Vision

PHDF‑Dataset is a pedestrian head detection dataset based on a fisheye camera, containing 2,201 images annotated with 13,887 pedestrian heads. Released by the Shien‑Ming Wu Lab of the School of Intelligent Engineering, South China University of Technology, for non‑commercial research purposes.

Source githubUpdated Mar 7, 2024249 viewsLinked

Inspect dataset

danaroth/whu_hi

Hyperspectral ImagingAgricultural Classification

The WHU‑Hi dataset (Wuhan UAV‑borne Hyperspectral Images) was collected and shared by the RSIDEA research group at Wuhan University, serving as a benchmark for precise crop classification and hyperspectral image classification research. It comprises three independent UAV‑borne hyperspectral datasets: WHU‑Hi‑LongKou, WHU‑Hi‑HanChuan, and WHU‑Hi‑HongHu, all captured over agricultural areas in Hubei Province, China. The data were acquired using a Headwall Nano‑Hyperspec sensor mounted on UAV platforms, providing high spatial resolution (H2 images). Pre‑processing includes radiometric calibration and geometric correction performed with the HyperSpec software supplied by the instrument vendor. Each dataset includes detailed acquisition metadata (time, weather, sensor, flight altitude, image size, number of bands, spatial resolution) and sample counts for various crop classes.

Source Wuhan University RSIDEAUpdated Jun 19, 20241,690 viewsLinked

Inspect dataset

Real-Vul

Software Vulnerability DetectionDeep Learning

The Real‑Vul dataset was developed by the School of Computer Science at the University of Waterloo to provide a comprehensive dataset for evaluating deep‑learning models in real‑world software vulnerability detection. It contains 5,528 C/C++ function samples drawn from diverse software projects such as the Chromium browser and the Linux operating system. The dataset was created using a time‑based split strategy to ensure realistic and timely training and testing data. Real‑Vul is primarily intended for assessing and improving the practical performance of existing vulnerability detection models, especially in complex and varied real‑world software environments.

Source arXivUpdated Jul 3, 2024291 viewsLinked

Inspect dataset

LEVIR-CC Dataset

Remote Sensing TechnologyImage Processing

The LEVIR‑CC dataset is a large dataset for remote‑sensing image change caption generation, specifically supporting research on change captioning using dual‑branch Transformers.

Source githubUpdated May 11, 2024508 viewsLinked

Inspect dataset

Hyperspectral imaging dataset

Hyperspectral ImagingRemote Sensing Technology

We provide a database of hyperspectral images (HSIs) supporting our research. The images cover a variety of real‑world materials and objects. The database is publicly released to the research community. Detailed information can be found in our manuscript.

Source githubUpdated Apr 22, 2024593 viewsLinked

Inspect dataset

MTA

Multi‑Object TrackingVideo Analysis

MTA (Multi‑Camera Track Auto) is a large multi‑target multi‑camera tracking dataset, containing over 2,800 person identities captured by 6 cameras, each video exceeding 100 minutes. The dataset spans both daytime and nighttime periods.

Source githubUpdated Feb 16, 2024246 viewsLinked

Inspect dataset

Secom-Dataset

Quality ControlFault Detection

The Secom dataset contains a unique rare‑event scenario with highly imbalanced output classes. It consists of 1,567 observations and 590 variables; each record represents a single production entity with associated measurement features. The `secom_labels.data` file provides pass/fail labels (‑1 = pass, 1 = fail) and timestamps for each test point.

Source githubUpdated Mar 15, 2021293 viewsLinked

Inspect dataset

Unifying Public Datasets for Insulator Detection and Fault Classification in Electrical Power Lines

Power System MaintenanceFault Diagnosis

This dataset unifies public datasets for insulator detection and fault classification in power lines, providing the merged data and the code used for merging.

Source githubUpdated May 19, 2024364 viewsLinked

Inspect dataset

cmm-math

Mathematics EducationMultimodal Data

CMM‑Math is a Chinese multimodal mathematics dataset containing over 28,000 high‑quality samples covering 12 grades from primary school to high school. It includes diverse question types such as multiple‑choice and fill‑in‑the‑blank, with detailed solutions. Some questions involve visual context, making the dataset more challenging. The dataset is split into a training set (22,000+ samples) and an evaluation set (5,000+ samples).

Source huggingfaceUpdated Sep 8, 2024761 viewsLinked

Inspect dataset

Breakfast dataset

Breakfast Behavior AnalysisMachine Learning

The dataset contains user action types, timestamps, and final goals, split into training and testing sets. Each set includes three files: action type, action time, and goal.

Source githubUpdated Dec 29, 2022238 viewsLinked

Inspect dataset

datasets-emoji

An emoji database containing each emoji's group, subgroup, code points, hash, status, rendered emoji, short name, description, and aliases.

Source githubUpdated Apr 12, 2024286 viewsLinked

Inspect dataset

lmms-lab/llava-interleave-bench

Multimodal ModelsComputer Vision

LLaVA‑Interleave Bench is a comprehensive multi‑image dataset collected from public datasets or generated via the GPT‑4V API. The dataset aims to evaluate the interleaved multi‑image reasoning capability of large multimodal models. It was collected in April 2024 and released in June 2024. Its primary use is for research on large multimodal models and chatbots, targeting researchers and enthusiasts in computer vision, natural language processing, machine learning, and AI.

Source hugging_faceUpdated Jul 4, 2024139 viewsLinked

Inspect dataset

CBSD68-dataset

Image DenoisingBenchmarking

The CBSD68 dataset is a color version of the BSD68 benchmark for image denoising, containing original .jpg files converted to lossless .png format and augmented with different levels of Gaussian white noise.

Source githubUpdated May 23, 2024575 viewsLinked

Inspect dataset

dim/lmsys_chatbot_arena_conversations

ChatbotDialogue Analysis

The dataset includes multiple features such as question ID, dialogue content from two models, winner label, judge, turn count, anonymity flag, language, timestamp, and OpenAI moderation results. Each dialogue records content and role, turn number, and anonymity status. Toxicity detection results from two large models are also captured, including binary flags and probabilities. The dataset is provided as a training split with 33,000 samples.

Source hugging_faceUpdated Nov 8, 2023170 viewsLinked

Inspect dataset

springsteen_db

The dataset comprises four sub‑datasets: - **songs**: information about Springsteen songs or related tracks, including albums and lyrics; - **concerts**: records of all Springsteen concerts since 1973, detailing venues, locations, and dates; - **setlists**: song lists performed at each concert; - **tours**: information about tours associated with each concert.

Source githubUpdated May 23, 2024114 viewsLinked

Inspect dataset

vaishali/geoQuery-tableQA

Multi‑Table Question AnsweringGeographic Information Processing

The dataset, named geoQuery‑tableQA, is intended for table‑question answering tasks. It includes fields such as `query`, `question`, `answer`, `table_names`, `tables`, `source`, and `target`. The dataset is split into training (530 samples), validation (49 samples), and test (253 samples) sets, with a total size of 14,896,902 bytes and a download size of 1,988,975 bytes.

Source hugging_faceUpdated Feb 21, 2024107 viewsLinked

Inspect dataset

sunhaozhepy/2022-typhoons-cma

Typhoon MonitoringMeteorological Data

The dataset includes images and wind‑speed data. Image data type is `image`; wind‑speed data type is `float32`. It also provides a classification label with categories TD, TS, STS, TY, STY, and SuperTY, corresponding to class IDs 0–5. The dataset consists of a single training split with 472 samples, total size 109,541,339.0 bytes, and a download size of 109,547,565 bytes.

Source hugging_faceUpdated Dec 24, 2023250 viewsLinked

Inspect dataset

WeiboHotListDataSet

Social Media AnalysisPublic Opinion Research

We scraped posts that appeared on the Weibo hot list from 2022‑11‑25 to 2023‑03‑08 (only posts from the day they trended) together with their associated comments.

Source githubUpdated Apr 21, 2023384 viewsLinked

Inspect dataset

Cereberal-Stroke-Analysis

StrokeMachine Learning

This dataset is used to analyze stroke, employing machine learning models and resampling techniques such as SMOTEENN to improve prediction accuracy and address dataset imbalance.

Source githubUpdated Dec 12, 2023402 viewsLinked

Inspect dataset

hyperspectral-fruit

Hyperspectral ImagingImage Processing

The dataset contains 100 images of various fruits and vegetables captured under controlled lighting conditions using a Living Optics camera. Data types include RGB images, sparse spectral samples, and instance segmentation masks. The dataset includes over 430,000 spectral samples, of which more than 85,000 belong to one of 19 categories. Additionally, 13 labeled images are provided as a validation set along with some unlabeled demonstration videos. The dataset is primarily used for image segmentation and classification tasks.

Source huggingfaceUpdated Jul 30, 2024685 viewsLinked

Inspect dataset

Phi3_intent_v43_3_w_unknown

Intent RecognitionNatural Language Processing

This dataset is used for intent recognition tasks, containing user queries and their corresponding correct intents. The dataset is divided into training and validation sets, used respectively for model training and performance evaluation.

Source huggingfaceUpdated Nov 15, 2024311 viewsLinked

Inspect dataset

M4-Instruct-Data

Multimodal ModelsArtificial Intelligence Research

M4‑Instruct is a multi‑image dataset collected in April 2024 from public datasets and the GPT‑4V API, intended for training large multimodal models. It is used for research on large multimodal models and chatbots, targeting audiences in computer vision, natural language processing, machine learning, and AI.

Source huggingfaceUpdated Jun 26, 2024221 viewsLinked

Inspect dataset

amathislab/SHOT7M2

BasketballBehavior Analysis

SHOT7M2 is a synthetic, hierarchical, compositional basketball behavior dataset with 7.2 million frames that demonstrate hierarchical organization of basketball actions. Based on the animation model of Starke et al., it uses neural state machines to predict future character poses. The dataset contains 4,000 clips, each with 1,800 frames, performed by a single agent executing various basketball motions. Each clip is labeled with one of four activity types: casual play, intense play, dribbling training, or no play. It provides 26‑keypoint skeletal poses and combinations of 4 activity types, 12 actions, and 14 basic motions.

Source hugging_faceUpdated Jul 15, 2024192 viewsLinked

Inspect dataset

FIT-RS

Instruction TuningModel Training

FIT‑RS is a large‑scale fine‑grained instruction‑tuning dataset containing 1,800,851 high‑quality instruction samples, designed to enhance the fine‑grained understanding capabilities of Remote Sensing Large Multimodal Models (RSLMMs).

Source githubUpdated Jun 7, 2024278 viewsLinked

Inspect dataset

VRAI Vehicle Re-identification Dataset

Unmanned Aerial VehiclesVehicle Re-identification

The VRAI dataset supports vehicle re‑identification research in aerial imagery. With the rapid growth of unmanned aerial vehicles (UAVs), UAV‑based visual applications have attracted increasing attention from both industry and academia. However, public UAV vehicle re‑ID datasets are scarce, limiting research despite potential applications such as long‑term tracking and visual object retrieval. VRAI addresses this gap by providing a large‑scale, publicly available collection of UAV‑captured vehicle images with comprehensive annotations.

Source githubUpdated Dec 25, 2023272 viewsLinked

Inspect dataset

Human-Body-Segmentation-Fall

Human Pose AnalysisSafety Monitoring

The dataset was reprocessed by the authors based on portions of the Le2i and SisFall datasets. Video clips of ADL and fall actions were converted into images using the OpenCV library, after which a human segmentation neural network was applied to obtain images containing only the human foreground, followed by binarization and morphological erosion and dilation processing. The dataset is divided into two parts, each comprising five categories: Blank (no person), Fall, Like‑fall (unstable balance), Lie, and Stand.

Source githubUpdated May 8, 2024267 viewsLinked

Inspect dataset

lego-image-dataset

Computer Vision3D Models

Lego part image dataset generated from 3‑D models, comprising 10 part types with 6,000 unique rendered images per part (total 60,000 images). Additionally, 60 photographic images per class are provided (total 600 photos).

Source githubUpdated Mar 16, 2024171 viewsLinked

Inspect dataset

AirCombat-WEZ

Air Combat SimulationWeapon Engagement Zone Prediction

This dataset supports the development and evaluation of Weapon Engagement Zone (WEZ) prediction models, generated through high‑fidelity air‑combat simulations. The dataset captures various scenarios and conditions representing engagements between shooter aircraft and targets.

Source githubUpdated Nov 30, 2024525 viewsLinked

Inspect dataset

maveriq/DocBank

Document AnalysisArtificial Intelligence

DocBank is a new large‑scale dataset constructed using weak supervision, designed to provide integrated text and layout information for downstream tasks. Currently, the DocBank dataset contains 500,000 pages of documents, with 400,000 for training, 50,000 for validation, and 50,000 for testing. Annotations are machine‑generated, the language is English, and the dataset is monolingual. Fields include image, token, bounding box, color, font, and label.

Source hugging_faceUpdated Jan 5, 2023397 viewsLinked

Inspect dataset

TrainingDataPro/miners-detection

Miner DetectionMining Safety

The dataset contains photos of miners taken in various mines. Each image is annotated with bounding‑box detection of miners and an attribute indicating whether the miner is sitting or standing. It is suitable for computer‑vision and safety‑assessment applications and is valuable for researchers, employers, and policymakers in the mining sector. The dataset structure includes raw images, bounding‑box annotations, and XML files containing box coordinates and labels.

Source hugging_faceUpdated Apr 25, 2024276 viewsLinked

Inspect dataset

FungiTastic

Fungal ClassificationMachine Learning

The FungiTastic dataset was created jointly by the University of West Bohemia, INRIA and the Czech Technical University in Prague. It comprises approximately 350,000 records and over 650,000 fungal photographs with detailed metadata. The dataset stems from twenty years of continuous data collection and supports various machine learning tasks such as closed‑set and open‑set classification. Rich metadata includes timestamps, camera settings, geographic coordinates, satellite imagery, and biological taxonomy. It is widely used for image classification problems in biology, particularly for fungal identification.

Source arXivUpdated Aug 25, 2024287 viewsLinked

Inspect dataset

PatchCamelyon (PCam) benchmark dataset

Medical Image AnalysisMachine Learning Competition

The dataset contains small‑size pathology images with corresponding labels indicating the presence of tumor tissue. Images are 96 × 96 pixels and were provided as part of a Kaggle competition.

Source githubUpdated Jul 30, 2024433 viewsLinked

Inspect dataset

Common Voice Dataset

Speech RecognitionCrowdsourced Data

This dataset contains speech contributions from the Common Voice community on the web platform; all contributions are included regardless of validation status. The dataset is released roughly every six months and includes audio files and associated metadata such as age, gender, accent, etc.

Source githubUpdated May 14, 2024275 viewsLinked

Inspect dataset

3DAIGC/gobjaverse

3D RenderingDataset

--- license: mit --- <p align="center"> <h1>G-buffer Objaverse</h1> <p> G-buffer Objaverse: High-Quality Rendering Dataset of Objaverse. [Chao Xu](mailto:eric.xc@alibaba-inc.com), [Yuan Dong](mailto:yuandong15@fudan.edu.cn), [Qi Zuo](mailto:muyuan.zq@alibaba-inc.com), [Junfei Zhang](mailto:miracle.zjf@alibaba-inc.com), [Xiaodan Ye](mailto:doris.yxd@alibaba-inc.com), [Wenbo Geng](mailto:rengui.gwb@alibaba-inc.com), [Yuxiang Zhang](mailto:yuxiangzhang.zyx@alibaba-inc.com), [Xiaodong Gu](https://scholar.google.com.hk/citations?user=aJPO514AAAAJ&hl=zh-CN&oi=ao), [Lingteng Qiu](https://lingtengqiu.github.io/), [Zhengyi Zhao](mailto:bushe.zzy@alibaba-inc.com), [Qing Ran](mailto:ranqing.rq@alibaba-inc.com), [Jiayi Jiang](mailto:jiayi.jjy@alibaba-inc.com), [Zilong Dong](https://scholar.google.com/citations?user=GHOQKCwAAAAJ&hl=zh-CN&oi=ao), [Liefeng Bo](https://scholar.google.com/citations?user=FJwtMf0AAAAJ&hl=zh-CN) ## [Project page](https://aigc3d.github.io/gobjaverse/) ## [Github](https://github.com/modelscope/richdreamer/tree/main/dataset/gobjaverse) ## [YouTube](https://www.youtube.com/watch?v=PWweS-EPbJo) ## [RichDreamer](https://aigc3d.github.io/richdreamer/) ## [ND-Diffusion Model](https://github.com/modelscope/normal-depth-diffusion) ## TODO - [ ] Release objaverse-xl alignment rendering data ## News - We have released a compressed version of the datasets, check the downloading tips! (01.14, 2024 UTC) - Thanks for [JunzheJosephZhu](https://github.com/JunzheJosephZhu) for improving the robustness of the downloading scripts. Now you could restart the download script from the break point. (01.12, 2024 UTC) - Release 10 Category Annotation of the Objaverse Subset (01.06, 2024 UTC) - Release G-buffer Objaverse Rendering Dataset (01.06, 2024 UTC) ## Download - Download gobjaverse ***(6.5T)*** rendering dataset using following scripts. ```bash # download_gobjaverse_280k index file wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/gobjaverse_280k.json # Example: python ./scripts/data/download_gobjaverse_280k.py ./gobjaverse_280k ./gobjaverse_280k.json 10 python ./download_gobjaverse_280k.py /path/to/savedata /path/to/gobjaverse_280k.json nthreads(eg. 10) # Or if the network is not so good, we have provided a compressed verison with each object as a tar file # To download the compressed version(only 260k tar files) python ./download_objaverse_280k_tar.py /path/to/savedata /path/to/gobjaverse_280k.json nthreads(eg. 10) # download gobjaverse_280k/gobjaverse index to objaverse wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/gobjaverse_280k_index_to_objaverse.json wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/gobjaverse_index_to_objaverse.json # download Cap3D text-caption file wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/text_captions_cap3d.json ``` - The 10 general categories including Human-Shape (41,557), Animals (28,882), Daily-Used (220,222), Furnitures (19,284), Buildings&&Outdoor (116,545), Transportations (20,075), Plants (7,195), Food (5,314), Electronics (13,252) and Poor-quality (107,001). - Download the category annotation using following scripts. ```bash # download category annotation wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/category_annotation.json # If you want to download a specific category in gobjaverse280k: # Step1: download the index file of the specified category. wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/gobjaverse_280k_split/gobjaverse_280k_{category_name}.json # category_name: Human-Shape, ... # Step2: download using script. # Example: python ./scripts/data/download_gobjaverse_280k.py ./gobjaverse_280k_Human-Shape ./gobjaverse_280k_Human-Shape.json 10 python ./download_gobjaverse_280k.py /path/to/savedata /path/to/gobjaverse_280k_{category_name}.json nthreads(eg. 10) ``` ## Folder Structure - The structure of gobjaverse rendering dataset: ``` |-- ROOT |-- dictionary_id |-- instance_id |-- campos_512_v4 |-- 00000 |-- 00000.json # Camera Information |-- 00000.png # RGB |-- 00000_albedo.png # Albedo |-- 00000_hdr.exr # HDR |-- 00000_mr.png # Metalness and Roughness |-- 00000_nd.exr # Normal and Depth |-- ... ``` ### Coordinate System #### Normal Coordinate System The 3D coordinate system definition is very complex. it is difficult for us to say what the camera system used. Fortunately, the target we want to get is mapping the world normal of rendering system to Normal-Bae system, as the following figure illustrates: ![normal-bae system](./normal-bae-system.png) where the U-axis and V-axis denote the width-axis and height-axis in image space, respectively, the xyz is the Normal-Bae camera view coordinate system. Note that public rendering system for Objaverse is blender-based system: ![00000_normal](./blender_world_normal.png) However, our rendering system is defined at **Unity-based system**, seeing: ![00000_normal](./unity-based.png) *A question is how do we plug in blender's coordinate system directly without introducing a new coordinate system?* A possible solution is that we maintain world to camera transfer matrix as blender setting, *transferring Unity-based system to blender-based system* We provide example codes to visualize the coordinate mapping. ```bash # example of coordinate experiments ## download datasets wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/Lingtengqiu/render_data_examples.zip unzip render_data_examples.zip ## visualizing blender-based system, and warping world-space normal to normal-bae system. python ./process_blender_dataset.py ## visualizing our system, and warping world-space normal to normal-bae system. python ./process_unity_dataset.py ``` #### Depth-Warpping We write an example to demonstrate that how to obtain intrinsic matrix K, and warp ref image to target image based on ref depth map. ```bash # build quick-zbuff code mkdir -p ./lib/build g++ -shared -fpic -o ./lib/build/zbuff.so ./lib/zbuff.cpp # an demo for depth-based Warpping # python ./depth_warp_example.py $REFVIEW $TARGETVIEW python3 ./depth_warp_example.py 0 3 ``` ## Citation ``` @article{qiu2023richdreamer, title={RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D}, author={Lingteng Qiu and Guanying Chen and Xiaodong Gu and Qi zuo and Mutian Xu and Yushuang Wu and Weihao Yuan and Zilong Dong and Liefeng Bo and Xiaoguang Han}, year={2023}, journal = {arXiv preprint arXiv:2311.16918} } ``` ``` @article{objaverse, title={Objaverse: A Universe of Annotated 3D Objects}, author={Matt Deitke and Dustin Schwenk and Jordi Salvador and Luca Weihs and Oscar Michel and Eli VanderBilt and Ludwig Schmidt and Kiana Ehsani and Aniruddha Kembhavi and Ali Farhadi}, journal={arXiv preprint arXiv:2212.08051}, year={2022} } ```

Source hugging_faceUpdated Jan 17, 2024295 viewsLinked

Inspect dataset

BETA dataset

Brain‑Computer InterfaceBenchmark Database

The BETA dataset from Tsinghua University is a large‑scale benchmark database designed to support research on SSVEP‑BCI applications.

Source githubUpdated Dec 25, 2022357 viewsLinked

Inspect dataset

RAiD_Dataset

Person Re-identificationCross-Scenario Surveillance

The RAiD dataset is an indoor–outdoor cross‑scenario re‑identification dataset collected at the Winstun Chung Hall of UC Riverside. It contains images from four cameras (two indoor and two outdoor), recording 6,920 images of 43 individuals. Detailed annotations include images, foreground masks, camera IDs, and person IDs.

Source githubUpdated Feb 23, 2024103 viewsLinked

Inspect dataset

RML2016

Signal ProcessingMachine Learning

The dataset contains a feature named `signal` of type float32 and a feature named `label_id` of type int32. It is split into training, validation, and test sets with 9,900, 539, and 1,100 samples respectively. Total download size is 35,524,532 bytes; dataset size is 12,000,560 bytes.

Source huggingfaceUpdated Dec 16, 2024432 viewsLinked

Inspect dataset

imdb-5000-movie-dataset

This dataset contains 5,000 randomly selected movie records from IMDB, with 28 attributes for each record.

Source githubUpdated Jun 23, 2023842 viewsLinked

Inspect dataset

pixparse/cc12m-wds

Image-Text PairingVision-Language Pretraining

--- license: other license_name: conceptual-12m license_link: LICENSE task_categories: - image-to-text size_categories: - 10M<n<100M --- # Dataset Card for Conceptual Captions 12M (CC12M) ## Dataset Description - **Repository:** [Conceptual 12M repository](https://github.com/google-research-datasets/conceptual-12m) - **Paper:** [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981) - **Point of Contact:** [Conceptual Captions e-mail](mailto:conceptual-captions@google.com) ### Dataset Summary Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M). ### Usage This instance of Conceptual Captions is in [webdataset](https://github.com/webdataset/webdataset/commits/main) .tar format. It can be used with webdataset library or upcoming releases of Hugging Face `datasets`. ...More Detail TBD ### Data Splits This dataset was downloaded using img2dataset. Images resized on download if shortest edge > 512 to shortest edge = 512. #### Train * `cc12m-train-*.tar` * Downloaded on 2021/18/22 * 2176 shards, 10968539 samples ## Additional Information ### Dataset Curators Soravit Changpinyo, Piyush Sharma, Nan Ding and Radu Soricut. ### Licensing Information The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. ### Citation Information ```bibtex @inproceedings{changpinyo2021cc12m, title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts}, author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu}, booktitle = {CVPR}, year = {2021}, } ```

Source hugging_faceUpdated Dec 15, 2023410 viewsLinked

Inspect dataset

iHarmony4

Image HarmonizationBenchmark Dataset

iHarmony4 is the first large‑scale image harmonization public benchmark dataset, containing four sub‑datasets: HCOCO, HAdobe5k, HFlickr, and Hday2night. Each sub‑dataset includes synthetic composite images, foreground masks of the composite images, and the corresponding real images.

Source githubUpdated May 18, 2024231 viewsLinked

Inspect dataset

prm800k

Natural Language ProcessingMachine Learning

This dataset contains data from [openai/prm800k](https://github.com/openai/prm800k). It is divided into two phases (phase1 and phase2), each with train and test splits. Features include labeler, timestamp, question, etc.; detailed feature types are described in the README.

Source huggingfaceUpdated Dec 14, 2024515 viewsLinked

Inspect dataset

poetry_dataset

PoetryText Analysis

The dataset comprises five features: title, author, text, identifier, and category. It is divided into a training set containing 47,829 samples, with a total size of 12,701,752 bytes. The download size is 9,836,711 bytes.

Source huggingfaceUpdated Nov 30, 2024283 viewsLinked

Inspect dataset

WebVi3D

Multi‑View ImagesModel Training

WebVi3D is a multi‑view image dataset containing 320 M frames extracted from 16 M video clips, used for training See3D models. The dataset expands training data by automatically filtering video clips with inconsistent multi‑view information or insufficient observations, yielding a high‑quality, diverse multi‑view image collection.

Source githubUpdated Dec 10, 2024257 viewsLinked

Inspect dataset

HunanMultimodalDataset

Remote Sensing TechnologyGeographic Information System (GIS)

This is a multimodal remote‑sensing dataset for Hunan Province in 2017, comprising Sentinel‑2, Sentinel‑1, and SRTM data. It contains 400 training images (256×256), and 50 validation and test images. The TRI in the training set is computed from SRTM via GDAL and can be used for knowledge reconstruction.

Source githubUpdated Mar 31, 2024220 viewsLinked

Inspect dataset

New York City Yellow Taxi Trip data

Taxi ServicesTraffic Data Analysis

The NYC Yellow Taxi trip records capture pickup and drop‑off dates and times, pickup and drop‑off locations, trip distance, itemized fare details, rate type, payment type, and the number of passengers reported by the driver.

Source githubUpdated Nov 15, 2024543 viewsLinked

Inspect dataset

google-research-datasets/nq_open

Open‑Domain Question AnsweringNatural Language Processing

--- annotations_creators: - expert-generated language_creators: - other language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|natural_questions task_categories: - question-answering task_ids: - open-domain-qa pretty_name: NQ-Open dataset_info: config_name: nq_open features: - name: question dtype: string - name: answer sequence: string splits: - name: train num_bytes: 6651236 num_examples: 87925 - name: validation num_bytes: 313829 num_examples: 3610 download_size: 4678245 dataset_size: 6965065 configs: - config_name: nq_open data_files: - split: train path: nq_open/train-* - split: validation path: nq_open/validation-* default: true --- # Dataset Card for nq_open ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://efficientqa.github.io/ - **Repository:** https://github.com/google-research-datasets/natural-questions/tree/master/nq_open - **Paper:** https://www.aclweb.org/anthology/P19-1612.pdf - **Leaderboard:** https://ai.google.com/research/NaturalQuestions/efficientqa - **Point of Contact:** [Mailing List](efficientqa@googlegroups.com) ### Dataset Summary The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia. ### Supported Tasks and Leaderboards Open Domain Question-Answering, EfficientQA Leaderboard: https://ai.google.com/research/NaturalQuestions/efficientqa ### Languages English (`en`) ## Dataset Structure ### Data Instances ``` { "question": "names of the metropolitan municipalities in south africa", "answer": [ "Mangaung Metropolitan Municipality", "Nelson Mandela Bay Metropolitan Municipality", "eThekwini Metropolitan Municipality", "City of Tshwane Metropolitan Municipality", "City of Johannesburg Metropolitan Municipality", "Buffalo City Metropolitan Municipality", "City of Ekurhuleni Metropolitan Municipality" ] } ``` ### Data Fields - `question` - Input open domain question. - `answer` - List of possible answers to the question ### Data Splits - Train : 87925 - validation : 3610 ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization Natural Questions contains question from aggregated queries to Google Search (Kwiatkowski et al., 2019). To gather an open version of this dataset, we only keep questions with short answers and discard the given evidence document. Answers with many tokens often resemble extractive snippets rather than canonical answers, so we discard answers with more than 5 tokens. #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases Evaluating on this diverse set of question-answer pairs is crucial, because all existing datasets have inherent biases that are problematic for open domain QA systems with learned retrieval. In the Natural Questions dataset the question askers do not already know the answer. This accurately reflects a distribution of genuine information-seeking questions. However, annotators must separately find correct answers, which requires assistance from automatic tools and can introduce a moderate bias towards results from the tool. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information All of the Natural Questions data is released under the [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license. ### Citation Information ``` @article{doi:10.1162/tacl\_a\_00276, author = {Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav}, title = {Natural Questions: A Benchmark for Question Answering Research}, journal = {Transactions of the Association for Computational Linguistics}, volume = {7}, number = {}, pages = {453-466}, year = {2019}, doi = {10.1162/tacl\_a\_00276}, URL = { https://doi.org/10.1162/tacl_a_00276 }, eprint = { https://doi.org/10.1162/tacl_a_00276 }, abstract = { We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. } } @inproceedings{lee-etal-2019-latent, title = "Latent Retrieval for Weakly Supervised Open Domain Question Answering", author = "Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1612", doi = "10.18653/v1/P19-1612", pages = "6086--6096", abstract = "Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.", } ``` ### Contributions Thanks to [@Nilanshrajput](https://github.com/Nilanshrajput) for adding this dataset.

Source hugging_faceUpdated Mar 22, 2024357 viewsLinked

Inspect dataset

ASAG2024

Automated ScoringEducational Assessment

ASAG2024 is a comprehensive short‑answer grading benchmark dataset created by the Zurich University of Applied Sciences. It comprises seven commonly used short‑answer grading datasets, totaling 19,000 question‑answer‑score triples across multiple subjects and education levels. Scores are normalized between 0 and 1 to facilitate comparison across datasets. The creation involved integrating data from multiple sources and standardizing them. The dataset is primarily used to evaluate and compare the performance of automated grading systems, aiming to address automation and generalizability challenges in short‑answer grading.

Source arXivUpdated Sep 27, 2024313 viewsLinked

Inspect dataset

talby/spamassassin

Spam FilteringNatural Language Processing

The SpamAssassin public email corpus is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems. The dataset contains various email samples divided into spam and ham categories, with further sub‑groups such as hard_ham, spam_2, spam, easy_ham, and easy_ham_2. Structure includes fields like label, group, text, and raw; only a training split is provided.

Source hugging_faceUpdated Jul 11, 2023417 viewsLinked

Inspect dataset

Curated Dataset from GRAZPEDWRI-DX

Medical Image RecognitionFine‑Grained Classification

The dataset used in this paper is a curated subset of GRAZPEDWRI‑DX, intended for fine‑grained recognition of wrist pathologies. It includes image data for training, validation, and testing sets.

Source githubUpdated Aug 24, 2024201 viewsLinked

Inspect dataset

UNSW-NB15

Network Intrusion DetectionAcademic Research

UNSW‑NB15 is a comprehensive network intrusion detection system dataset for academic research. The dataset was compared statistically with the KDD99 dataset and was released by Nour Moustafa and Jill Slay; publications must be cited when using it.

Source githubUpdated Jan 19, 2024405 viewsLinked

Inspect dataset

ActPlan-1K

Vision-Language ModelsProgram Planning

ActPlan‑1K is a multimodal planning benchmark jointly created by the Hong Kong University of Science and Technology and the University of California, San Diego. It evaluates vision‑language models' program planning abilities in domestic activities. The dataset includes 153 activities and 1,187 instances, each comprising a natural‑language task description and multiple environment images captured from the iGibson2 simulator. The creation process combined ChatGPT and iGibson2, converting BDDL activity definitions into natural‑language descriptions and collecting environment images. ActPlan‑1K is primarily used to assess the program planning capabilities of vision‑language models in multimodal tasks, especially for home activities and counterfactual scenarios.

Source arXivUpdated Oct 5, 2024247 viewsLinked

Inspect dataset

cj-mills/hagrid-sample-120k-384p

Gesture RecognitionImage Data

The HaGRID sample dataset contains 127,331 images at 384p resolution, sampled from the original HaGRID dataset (552,992 images at 1080p, total size 716 GB). It was created to enable usage on free tiers of Google Colab and Kaggle Notebooks. The dataset is primarily intended for object‑detection tasks and includes annotations for multiple hand‑gesture categories such as call, no_gesture, like, etc. Annotation data comprise bounding‑box coordinates and class labels.

Source hugging_faceUpdated Jul 2, 2023221 viewsLinked

Inspect dataset

MV-Video

3D Animation GenerationMulti-View Video Processing

The MV-Video dataset is a large‑scale multi‑view video dataset consisting of 53 K rendered animated 3D objects. It is used to train the Animate3D model ([Animate3D: Animating Any 3D Model with Multi‑view Video Diffusion](https://animate3d.github.io/)).

Source huggingfaceUpdated Oct 22, 2024233 viewsLinked

Inspect dataset

Gharaee/BIOSCAN_1M_Insect_Dataset

BiodiversityInsect Genetic Classification

The BIOSCAN_1M insect dataset provides information about insects. Each record includes four primary attributes: DNA barcode sequence, barcode index number (BIN), taxonomic rank annotation, and RGB image. The DNA barcode sequence shows the nucleotide arrangement, BIN serves as an alternative to Linnaean names, providing gene‑centered taxonomy, taxonomic rank annotation classifies organisms hierarchically based on evolutionary relationships, and the RGB image displays raw images from the 16 most densely sampled insect orders. The dataset also illustrates class distribution and class imbalance, which are inherent characteristics of insect communities.

Source hugging_faceUpdated Jun 20, 2024108 viewsLinked

Inspect dataset

Microplastics in Drinking Water

Water Quality MonitoringPlastic Pollution

This dataset records microplastic presence in drinking water. Each row represents a water‑sample record, containing microplastic material and type, color, water source type (tap or bottled), and the sampling location's latitude and longitude. The dataset focuses on polyethylene (PE) material for predicting PE levels across different geographic locations.

Source githubUpdated Feb 24, 2024229 viewsLinked

Inspect dataset

Distraction Detection Dataset

Distracted Driving DetectionTraffic Safety

The multimodal distracted driving detection dataset (2016‑2018) and the earlier dataset (2014‑2016) include detailed information such as the number of drivers and average driving time.

Source githubUpdated Dec 16, 2018195 viewsLinked

Inspect dataset

AdvancedEdit

Image ProcessingData Augmentation

The AdvancedEdit dataset was created via a novel data construction pipeline, featuring high visual quality, complex instructions, and good background consistency at a large scale.

Source githubUpdated Dec 5, 2024144 viewsLinked

Inspect dataset

HOI-dataset

Human‑Computer InteractionComputer Vision

HOI-dataset is a depth‑map hand‑part segmentation dataset for hand–object interaction, providing download links for training and validation sets.

Source githubUpdated Sep 20, 2023196 viewsLinked

Inspect dataset

bodyfat dataset

Health MonitoringData Analysis

The dataset records body‑fat percentage, age, weight, height, and ten body‑circumference measurements (e.g., waist) for 252 male subjects. Body‑fat, a health indicator, is accurately estimated via underwater weighing. By applying multivariate regression, body‑fat can be predicted using only a scale and a measuring tape, providing a convenient method for estimating male body‑fat.

Source githubUpdated Nov 19, 2021303 viewsLinked

Inspect dataset

zgcarvalho/oas-test

BiologyProtein Research

Observed Antibody Space数据集是一个生物学和蛋白质领域的数据集，包含配对和不配对两种配置。配对配置包含重链和轻链的序列信息，而不配对配置仅包含单一链的序列信息。数据集的大小在10M到100M之间，适用于大规模数据处理。

Source hugging_faceUpdated Sep 28, 202388 viewsLinked

Inspect dataset

MFreidank/glenda

Gynecological MedicineMedical Image Recognition

GLENDA (Gynecologic Laparoscopy Endometriosis Dataset) comprises over 350 annotated images of endometriosis lesions from more than 100 gynecologic laparoscopic surgeries, plus over 13,000 unannotated non‑pathological images from more than 20 surgeries. It is designed for automated content analysis, particularly for endometriosis detection. The dataset offers binary and multi‑class classification configurations, each with detailed feature descriptions and split information. Annotations were produced by medical experts specializing in endometriosis treatment. Use is restricted to scientific research and prohibited for commercial purposes.

Source hugging_faceUpdated Dec 29, 2022313 viewsLinked

Inspect dataset

Sulav/mental_health_counseling_conversations_sharegpt

Mental HealthDialogue Data

The dataset comprises mental‑health counseling dialogues. Primary features include `Context`, `Response`, and `conversations`. Each conversation entry contains a `from` field (sender) and a `value` field (content). The training split contains 3,512 samples with a total size of 9,356,552 bytes.

Source hugging_faceUpdated Mar 8, 2024145 viewsLinked

Inspect dataset

azminetoushikwasi/SupplyGraph

Supply ChainGraph Neural Networks

SupplyGraph is a benchmark dataset for supply chain planning, especially suited for applications of Graph Neural Networks (GNNs). The dataset originates from a leading Fast-Moving Consumer Goods (FMCG) company in Bangladesh and includes time‑series data as node features for sales forecasting, production planning, and factory problem identification. Researchers can apply GNNs to a variety of supply‑chain problems, advancing analysis and planning in the field.

Source hugging_faceUpdated Jun 9, 2024229 viewsLinked

Inspect dataset

open-llm-leaderboard-old/details_CultriX__MonaTrix-v4-7B-DPO

Model EvaluationNatural Language Processing

This dataset was automatically generated during the evaluation of model CultriX/MonaTrix‑v4‑7B‑DPO. It comprises 63 configurations, each mapping to a specific evaluation task. Each run creates a split named after its timestamp; the `train` split always points to the latest results. An additional `results` configuration aggregates outcomes from all runs for metric computation on the Open LLM Leaderboard.

Source hugging_faceUpdated Apr 18, 202474 viewsLinked

Inspect dataset

lissadesu/code_qa_updated

Programming Q&A

This dataset is primarily designed for code‑related question‑answer tasks and includes features such as labNo, taskNo, questioner, question, code, startLine, endLine, questionType, answer, src, code_processed, id, raw_code, raw_comment, comment, and q_code. The dataset is split into a training set containing 35,360 samples, with a total size of 46,842,820 bytes and a download size of 17,749,500 bytes.

Source hugging_faceUpdated Oct 6, 2023111 viewsLinked

Inspect dataset

egoqa

Video Question AnsweringFirst-Person Perspective

The Ego‑QA‑19k dataset contains 19k video‑question‑answer pairs in first‑person view scenarios. Dataset creation involved two stages: first, video subtitles were concatenated chronologically to generate video descriptions, then GPT‑4 generated 20 questions per video; second, questions containing specific cue words were filtered out, and graduate‑level native English speakers ensured question authenticity and the required video length to answer each question.

Source huggingfaceUpdated Oct 5, 2024287 viewsLinked

Inspect dataset

NBA play-by-play data and shot details

Basketball Data AnalysisSports Statistics

The dataset contains play‑by‑play NBA game data and shot‑detail information from the 1996/97 season through the 2023/24 season, sourced from stats.nba.com, data.nba.com, and pbpstats.com.

Source githubUpdated May 20, 2024957 viewsLinked

Inspect dataset

IGNF/FRACTAL

3D Point Cloud SegmentationEarth Observation

FRACTAL is a benchmark dataset for 3D point‑cloud semantic segmentation, comprising 100,000 point clouds covering a 250 km² area across five regions in France. Originating from the Lidar HD project, it balances rare classes through an efficient sampling strategy and includes challenging landscapes. Each point cloud spans 50 × 50 m with high point density (average 37 points/m²). The dataset provides seven semantic classes and is colorized using high‑resolution aerial imagery. It is split into training (80,000), validation (10,000), and test (10,000) subsets, with metadata documenting acquisition years and distribution details.

Source hugging_faceUpdated Apr 5, 2025186 viewsLinked

Inspect dataset

E-Commerce Dataset

E-commerceDataset Structure

The dataset consists of five interrelated tables, each containing key information about customers, transactions, branches, and merchants. These tables include the customers table, genders table, cities table, transactions table, branches table, and merchants table.

Source githubUpdated Sep 19, 2024302 viewsLinked

Inspect dataset

AdapterOcean/med_alpaca_standardized_cluster_8

Natural Language ProcessingText Clustering

--- dataset_info: features: - name: text dtype: string - name: conversation_id dtype: int64 - name: embedding sequence: float64 - name: cluster dtype: int64 splits: - name: train num_bytes: 145562012 num_examples: 14666 download_size: 42803368 dataset_size: 145562012 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for "med_alpaca_standardized_cluster_8" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

Source hugging_faceUpdated Oct 23, 202357 viewsLinked

Inspect dataset

1aurent/LC25000

Image ClassificationCancer Research

The dataset contains 25,000 color images divided into five classes, each class comprising 5,000 images. All images are 768 × 768 pixels in JPEG format.

Source hugging_faceUpdated May 25, 2024260 viewsLinked

Inspect dataset

meta-math/MetaMathQA

Math Problem SolvingNatural Language Processing

MetaMathQA is a dataset enhanced from the training sets of GSM8K and MATH, without using any test‑set data. Each original question can be found in `meta-math/MetaMathQA`; these items originate from GSM8K or MATH training sets.

Source hugging_faceUpdated Dec 21, 2023427 viewsLinked

Inspect dataset

SnapGarden_v0.6

Plant IdentificationImage Captioning

SnapGarden v0.6 is a dataset containing 1,000 images of 25 plant species; each image has been augmented five times and is accompanied by a question‑answer style description, intended to help AI learn how to describe plants. The dataset is split into training, validation, and test sets, suitable for image captioning, plant recognition, and educational content development. It is released under the MIT license, emphasizing respect for original image owners and copyright.

Source huggingfaceUpdated Jan 30, 2025254 viewsLinked

Inspect dataset

P2ANET

Video AnalysisSports Technology

The P2ANET dataset is a large‑scale benchmark for dense action detection from table‑tennis broadcast videos. It consists of two parts: a raw dataset and a processed dataset, collected in two batches (v1 and v2). Video data were captured with an RGB monocular camera, and labels were obtained via manual annotation.

Source githubUpdated Dec 14, 2023239 viewsLinked

Inspect dataset

ffhq-256_training_faces

Face RecognitionComputer Vision

The dataset contains four features: image, original_index, landmark, and mask. The image feature is stored as an image format, original_index is an integer, landmark is a sequence of integers, and mask is null. The dataset is divided into two parts: base_transforms (69,426 samples) and random_aug_transforms (26,435 samples). Total download size is 8,177,644,392 bytes and total dataset size is 8,315,251,492.07 bytes.

Source huggingfaceUpdated Sep 16, 2024202 viewsLinked

Inspect dataset

Brazilian E-Commerce Public Dataset by Olist

E-commerceData Analysis

The Brazilian E‑Commerce Public Dataset by Olist contains order information from 2016‑2018 across multiple marketplaces in Brazil, with 100,000 orders. Features allow multi‑dimensional analysis of orders, including status, price, payment, shipping performance, customer location, product attributes, and customer reviews. A geographic dataset with latitude‑longitude coordinates linked to Brazilian postal codes is also provided.

Source githubUpdated May 13, 2024445 viewsLinked

Inspect dataset

tuguobin/AMASS

Human Motion Analysis

The `raw` folder contains the original AMASS dataset files. The `amass_joints_h36m_60.pkl` file is a preprocessed version that can be generated by running the `convert_amass.py` script. The `train.tar.bz2` file is the converted dataset; simply extract it to the appropriate location for use.

Source hugging_faceUpdated Mar 26, 2024325 viewsLinked

Inspect dataset

Danbooru2018 Anime Character Recognition Dataset

Anime Character RecognitionImage Processing

This dataset is based on the Danbooru2018 dataset for anime character recognition, containing 1 million images and 70,000 characters. The dataset has been processed to generate 1 million head images and their corresponding character labels. The character label distribution follows a long‑tail, with an average of 13.85 images per label.

Source githubUpdated May 18, 2024453 viewsLinked

Inspect dataset

ll-13/GLH-Bridge

Bridge InspectionRemote Sensing Imagery

The GLH-Bridge dataset is a large‑scale dataset for bridge detection in large‑size very high resolution (VHR) remote sensing images. More details can be found in the paper "Learning to Holistically Detect Bridges from Large‑Size VHR Remote Sensing Imagery" (TPAMI 2024).

Source hugging_faceUpdated Jun 15, 2024134 viewsLinked

Inspect dataset

livecodebench_1_to_4

Programming ContestTest Cases

This dataset contains programming contest‑related problems and test case information. Each sample includes problem title, problem content, platform, problem ID, contest ID, contest date, starter code, difficulty, public test cases, private test cases, and metadata. The dataset is provided as a single test set with 713 samples, total size 3.79 GB. The configuration name is `default`, and data files are located at `data/test-*`.

Source huggingfaceUpdated Dec 8, 2024306 viewsLinked

Inspect dataset

IST-3 CT Head Scans

Medical ImagingStroke Research

The IST-3 CT head scan dataset was created by the Clinical Brain Sciences Centre at the University of Edinburgh, containing 10,659 CT series for research on intracranial arterial calcification segmentation. The dataset originates from the third International Stroke Trial (IST-3), involving non‑contrast CT scans of 3,035 acute ischemic stroke patients. During creation, registration to a template and quality control ensured validity and accuracy. The dataset primarily supports deep‑learning methods for stroke risk assessment, especially automatic quantification of intracranial arterial calcifications.

Source arXivUpdated Aug 2, 2024575 viewsLinked

Inspect dataset

...