High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

PHDF-Dataset

Pedestrian Detection

Computer Vision

PHDF‑Dataset is a pedestrian head detection dataset based on a fisheye camera, containing 2,201 images annotated with 13,887 pedestrian heads. Released by the Shien‑Ming Wu Lab of the School of Intelligent Engineering, South China University of Technology, for non‑commercial research purposes.

lmms-lab/llava-interleave-bench

Multimodal Models

Computer Vision

LLaVA‑Interleave Bench is a comprehensive multi‑image dataset collected from public datasets or generated via the GPT‑4V API. The dataset aims to evaluate the interleaved multi‑image reasoning capability of large multimodal models. It was collected in April 2024 and released in June 2024. Its primary use is for research on large multimodal models and chatbots, targeting researchers and enthusiasts in computer vision, natural language processing, machine learning, and AI.

lego-image-dataset

Computer Vision

Lego part image dataset generated from 3‑D models, comprising 10 part types with 6,000 unique rendered images per part (total 60,000 images). Additionally, 60 photographic images per class are provided (total 600 photos).

HOI-dataset

Human‑Computer Interaction

Computer Vision

HOI-dataset is a depth‑map hand‑part segmentation dataset for hand–object interaction, providing download links for training and validation sets.

ffhq-256_training_faces

Face Recognition

Computer Vision

The dataset contains four features: image, original_index, landmark, and mask. The image feature is stored as an image format, original_index is an integer, landmark is a sequence of integers, and mask is null. The dataset is divided into two parts: base_transforms (69,426 samples) and random_aug_transforms (26,435 samples). Total download size is 8,177,644,392 bytes and total dataset size is 8,315,251,492.07 bytes.

Multi-Mask Inpainting Dataset

Image Restoration

Computer Vision

The dataset is intended for multi‑mask image inpainting tasks. It contains images downloaded from the WikiArt API together with globally and object‑level annotations generated by the Kosmos‑2 and LLaVA models. Creation involved image download, mask generation, and construction of an entity dataset.

CampusGuard

Campus Behavior Monitoring

Computer Vision

The CampusGuard dataset is specifically annotated and categorized for student behaviors in campus environments, aiming to improve the YOLOv8 model with extensive training samples. It includes five main categories: “Using Mobile Phone”, “No Helmet”, “Sleeping”, “Triples”, and “Violence”. These categories cover common behaviors both inside and outside classrooms and reflect the diversity of campus safety and student behavior management.

Doraemon-AI/pdf-layout-chinese

Document Layout Analysis

Computer Vision

pdf-layout-chinese is a Chinese document layout analysis dataset focusing on Chinese scholarly documents (e.g., papers). The dataset provides 10 layout classes: text, title, image, image title, table, table title, header, footer, caption, and formula. It contains 5,000 training images and 1,000 validation images; each image has a correspondingly named JSON annotation file. Annotations were created with labelme and support polygon shapes.

UFPR-VCR Dataset

Vehicle Color Recognition

Computer Vision

The UFPR Vehicle Color Recognition (UFPR‑VCR) dataset aims to address more complex vehicle color recognition scenarios than previous studies. The dataset contains 10,039 images covering 9,502 vehicles of various categories such as cars, trucks, buses, and vans, and the images exhibit a range of real‑world conditions including front and rear views, partial occlusions, diverse lighting, and nighttime scenes.

geometric-shapes

Geometric Shapes

Computer Vision

The Geometric Shapes dataset is a synthetic collection containing images of various geometric shapes overlaid with random text. Each image has a random‑colored background, a shape (or just text), and a short random string partially occluding the shape. It is designed for shape classification, image recognition, and robustness testing of computer‑vision models.

imagenet-1k-32x32

Image Classification

Computer Vision

ImageNet is a large‑scale image classification dataset created via crowdsourcing. It contains a massive number of images labeled with specific categories, all labels are in English. The dataset size ranges between 1 million and 10 million, source data is original, task type is image classification, specific task ID is multi‑class image classification. The dataset details the features, including images and labels, and provides a list of class names.

Intentonomy

Social Media Analysis

Computer Vision

Intentonomy is a dataset of 14,455 images created jointly by Cornell University and Facebook AI to understand and analyze human intent behind social‑media images. The images span everyday scenarios and are manually annotated with 28 intent categories using a psychology‑based taxonomy. Labels were collected via a novel “purpose game” on Amazon Mechanical Turk. The dataset supports tasks such as fake‑news detection and improving vision systems’ understanding of human intent.

RGB-D Saliency Datasets

Computer Vision

Salient Object Detection

We have collected and shared multiple ready‑to‑use RGB‑D saliency datasets, including a test set and two popular training sets. The datasets cover various scenes and scales, suitable for RGB‑D saliency detection research.

whyen-wang/coco_captions

Image Captioning

Computer Vision

COCO is a large-scale dataset for object detection, segmentation, and captioning, primarily used for image-to-text tasks. The dataset provides English captions, each image being associated with multiple textual descriptions. Detailed information about dataset creation, annotation processes, or social impact is not supplied.

PlantDoc

Plant Disease Detection

Computer Vision

PlantDoc is a dataset for visual plant disease detection, containing 2,598 data points covering 13 plant species and up to 17 disease classes, annotated manually over approximately 300 hours from images scraped from the Internet. The dataset aims to achieve early detection of plant diseases via computer‑vision methods, improving classification accuracy by up to 31%.

stockeh/dog-pose-cv

Computer Vision

Animal Behavior Recognition

The dataset contains 20,578 images of dogs in various poses, labeled as ‘standing’, ‘sitting’, ‘lying down’, or ‘undefined’. It is intended for computer‑vision tasks that identify dog behavior from images. The images span 120 dog breeds with varying resolutions; 50 % of the images have resolutions between 361 × 333 and 500 × 453 pixels. The dataset is adapted from the Stanford Dog Dataset with re‑labeled poses. Class distribution is imbalanced, with ‘lying down’ nearly double the ‘sitting’ images, and ‘undefined’ mainly consisting of close‑up portraits, which may limit processing of such images. Users should consider class‑balancing techniques such as oversampling or data augmentation.

BEHAVE

Human‑Computer Interaction

Computer Vision

BEHAVE is a dataset that captures full‑body human‑object interactions in natural environments. It provides multi‑view RGB‑D frames together with corresponding 3D SMPL and object fittings, as well as annotated contacts between them.

Meehai/dronescapes

Computer Vision

Image Processing

The Dronescapes dataset comprises various representations extracted from drone‑captured videos, including RGB, optical flow, depth, edges, and semantic segmentation. It can be downloaded directly from HuggingFace or generated from raw videos and labels. The dataset is roughly 500 GB, contains video data from multiple scenes, and provides detailed generation and processing steps. It also offers training, validation, semi‑supervised, and test splits, along with tools for data inspection.

hssd/hssd-hab

3D Scene Understanding

Computer Vision

Habitat Synthetic Scenes Dataset (HSSD) is an artificially created 3D scene dataset designed to more realistically simulate real‑world environments. It contains 211 scenes and over 18,000 models of real‑world objects, covering a variety of indoor settings. The dataset structure includes folders for objects, stages, and scenes, each containing the corresponding 3D models and configuration files. It also supports Habitat 3.0 rearrangement tasks, providing updated colliders, adjusted and de‑cluttered scene contents, receiver meshes, and receiver filter files.

SynCamVideo Dataset

Computer Vision

Multi‑Camera Synchronization

SynCamVideo Dataset is a multi‑camera synchronized video dataset rendered with Unreal Engine 5. It comprises 1,000 distinct scenes, each captured by 36 cameras, resulting in a total of 36,000 videos. The dataset features 50 different animal species as primary objects and uses 20 locations from Poly Haven as backgrounds. In each scene, 1–2 animals are selected from the 50 species and moved along predefined trajectories while the background is randomly chosen from the 20 locations, with all 36 cameras recording the motion simultaneously.

RichHF-18K

Computer Vision

The RichHF‑18K dataset contains the extensive human‑feedback labels we collected for our CVPR 24 paper, along with the original filenames of the labeled images. It includes subjective scores (e.g., aesthetic ratings), human‑annotated heatmaps (e.g., regions of pixel‑level distortion), and misalignment marks in textual prompts. The dataset consists of 17,760 examples in TensorFlow Example format, comprising 15,810 training examples, 995 development examples, and 955 test examples.

Awesome Satellite Imagery Datasets

Satellite Imagery

Computer Vision

A list of satellite‑image datasets for computer‑vision and deep‑learning applications. Each dataset entry includes a detailed description covering source, size, resolution, and other attributes.

Open Images dataset

Image Recognition

Computer Vision

Open Images is a dataset containing approximately 9 million images annotated with over 6,000 category labels. The dataset is provided by Google under a CC BY 4.0 license and is split into training and validation sets, each image having a unique 64‑bit ID and possibly multiple labels.

NTU Dataset

Action Recognition

Computer Vision

The NTU dataset is a multi‑video collection that records 60 different human actions, each captured by three cameras from distinct viewpoints. The data files contain per‑frame skeletal coordinates.

Flickr30K, COCO

Multimodal Learning

Computer Vision

Flickr30K is a multimodal dataset containing images and text, used for training and validating algorithms that mine similarity between images and text. COCO is a large, rich image dataset primarily used for object detection, segmentation, and image captioning tasks.

Mahjong Dataset

Computer Vision

Machine Learning

A computer‑vision dataset for Chinese Mahjong tiles, containing various tile images and their labels for training and testing machine‑learning models.

Phando/vqa_v2

Visual Question Answering

Computer Vision

This dataset, named vqa_v2, contains multiple features such as question type, multiple-choice answer, answer list (including answer, answer confidence, and answer ID), image ID, answer type, question ID, question, and image. The dataset is split into training, validation, and test parts, containing 443,757, 214,354, and 447,793 samples respectively. The download size is 34,818,002,031 bytes, and the total size is 171,555,262,245.114 bytes.

DAiSEE

Facial Expression Recognition

Computer Vision

The DAiSEE dataset is used for facial expression recognition and contains 300 frames extracted from each video. Each frame includes 68 facial landmark points and 52 action units.

Door Detection & Classification Dataset

Computer Vision

Image Classification

This dataset is designed for door object detection, classification, and semantic segmentation. Images were captured with a 3D Realsense D435 camera, with the angle adjusted to ensure the complete door region. The dataset is divided into two subsets: one containing 240 images for door detection/semantic segmentation, and another containing 1,206 images for door classification. The classification categories are closed doors, partially open doors, and fully open doors, and the dataset provides detailed training, validation, and test splits.

MD-syn

Multimodal Image Matching

Computer Vision

MD‑syn is a new comprehensive dataset for general multimodal image matching. It is generated from the MegaDepth dataset using the MINIMA data engine and adds six additional modalities: infrared, depth, event, normal, sketch, and painting.

OCID-Grasp

Robotic Grasping

Computer Vision

The OCID Grasp dataset was created by the Institute of Computer Graphics and Vision at Graz University of Technology, Austria. It extends the original OCID dataset with 1,763 RGB‑D images, over 11.4 k object segmentation masks, and more than 75 k manually annotated grasp candidates. Each object is assigned to one of 31 categories. The dataset supports research on robot grasp detection in complex scenes by combining semantic segmentation with grasp detection.

Multi-view Flash/no-flash Dataset

Computer Vision

3D Reconstruction

The dataset contains multi‑view flash/no‑flash images together with corresponding camera poses and point‑cloud initializations generated by COLMAP.

UDIS-D

Image Stitching

Computer Vision

The UDIS‑D dataset provides images and masks for unsupervised deep image stitching. It is split into training and testing sets, each containing multiple images and their corresponding masks.

Pedestrian-Traffic-Lights (PTL)

Computer Vision

Traffic Signal Detection

Pedestrian‑Traffic‑Lights (PTL) is a high‑quality street‑intersection image dataset for detecting pedestrian traffic lights and crosswalks. The images vary in weather, location, orientation, and the size and type of intersections.

HRSC2016

Object Detection

Computer Vision

The HRSC 2016 dataset can be downloaded from the provided link and is used for object detection tasks.

NT-VOT211

Nighttime Visual Tracking

Computer Vision

NT‑VOT211 is a large‑scale night‑time visual object tracking benchmark created by Xinjiang University and other institutions. The dataset contains 211 diverse videos, totaling 211,000 meticulously annotated frames, covering eight attributes, and is designed to evaluate tracking algorithms under night‑time conditions. It was built to address the lack of suitable night‑time data in existing benchmarks. NT‑VOT211 finds applications in security surveillance, autonomous driving, and wildlife protection, especially for object tracking in low‑light environments.

H3WB: Human3.6M 3D WholeBody Dataset

3D Pose Estimation

Computer Vision

H3WB is a large‑scale 3D full‑body pose estimation dataset, an extension of the Human3.6M dataset, containing 133 full‑body keypoints (17 body, 6 foot, 68 face, 42 hand) annotated on 100,000 images. The skeleton layout matches that of the COCO‑WholeBody dataset.

HuggingFaceM4/A-OKVQA

Visual Question Answering

Computer Vision

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: image dtype: image - name: question_id dtype: string - name: question dtype: string - name: choices list: string - name: correct_choice_idx dtype: int8 - name: direct_answers dtype: string - name: difficult_direct_answer dtype: bool - name: rationales list: string splits: - name: train num_bytes: 929295572.0 num_examples: 17056 - name: validation num_bytes: 60797340.875 num_examples: 1145 - name: test num_bytes: 338535925.25 num_examples: 6702 download_size: 1323807326 dataset_size: 1328628838.125 --- # Dataset Card for "A-OKVQA" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

HPatches

Computer Vision

Image Processing

The HPatches dataset contains patches extracted from multiple image sequences, each sequence comprising images of the same scene. Sequences are organized by transformation type into illumination changes and viewpoint changes. Each image sequence provides reference patches and corresponding patches from other images, with patch size 65 × 65 pixels. The dataset is used to evaluate the performance of local descriptors.

2D Geometric Shapes Dataset

Computer Vision

Image Recognition

This repository contains a Python script for generating a 2D geometric shapes dataset, along with the dataset itself. The dataset includes 16 different geometric shapes, each randomly oriented and positioned within 224 × 224 pixel images.

BSDS500/300, BSD68, Set12

Image Processing

Computer Vision

BSDS500/300 is a dataset provided by the Berkeley Vision Lab for image segmentation or contour detection, and is also used for super‑resolution reconstruction. The database contains 200 training images, 200 validation images, and 100 test images, with ground‑truth annotations stored in MAT files. BSD68 is a color dataset for image denoising benchmarks and is part of the Berkeley Segmentation Dataset and Benchmark. Set12 contains 12 images for evaluating image denoising algorithms.

minhanhto09/NuCLS_dataset

Computer Vision

The NuCLS dataset comprises over 220,000 annotated nuclei from breast cancer images, primarily for developing and validating nucleus detection, classification, and segmentation algorithms. Annotations were performed by pathologists, pathology residents, and medical students, covering both single‑observer and multi‑observer evaluations. The dataset consists of 1,744 entries, each containing high‑resolution RGB images, mask images, visualization images, and nucleus annotation coordinates, split into six folds with separate training and test subsets to assess cross‑institution generalization. It is suitable for image classification, detection, and segmentation tasks.

SubPipe

Underwater Pipeline Detection

Computer Vision

SubPipe is a dataset created by Ocean Scan Marine Systems & Technology Co., Ltd. for underwater pipeline inspection, supporting SLAM, object detection, and image segmentation tasks. The dataset was collected by a Light‑weight Autonomous Underwater Vehicle (LAUV) in real pipeline inspection environments and includes RGB images, side‑scan sonar images, and inertial navigation system data. Creation involved manual annotation and precise synchronization of sensor data to ensure accuracy and usability. SubPipe is primarily applied to the development and testing of underwater computer‑vision algorithms, especially for pipeline detection and AUV navigation.

facescrub-dataset

Face Recognition

Computer Vision

The dataset contains 47,500 face images, each 50 × 50 pixels in colour, sourced from facescrub. It is intended for training and validation, extracted using OpenCV HOG face detection and not manually cleaned.