High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

MTA

MTA (Multi‑Camera Track Auto) is a large multi‑target multi‑camera tracking dataset, containing over 2,800 person identities captured by 6 cameras, each video exceeding 100 minutes. The dataset spans both daytime and nighttime periods.

github

View Details

P2ANET

Video Analysis

Sports Technology

The P2ANET dataset is a large‑scale benchmark for dense action detection from table‑tennis broadcast videos. It consists of two parts: a raw dataset and a processed dataset, collected in two batches (v1 and v2). Video data were captured with an RGB monocular camera, and labels were obtained via manual annotation.

github

View Details

COIN Dataset

Video Analysis

Dataset

COIN is currently the largest comprehensive instructional video analysis dataset, containing 11,827 videos covering 180 different tasks across 12 domains. All videos are collected from YouTube and annotated using an efficient toolbox.

github

View Details

UCF-101, HMDB-51

Video Analysis

Action Recognition

UCF‑101 and HMDB‑51 are two video datasets used for training and testing video‑processing models. UCF‑101 contains 101 action categories with over 100 videos per category. HMDB‑51 includes 51 action categories with at least 101 videos per category.

github

View Details

OccludeNet

Video Analysis

Occlusion Handling

OccludeNet is a large-scale occlusion video dataset that includes real-world and synthetic occluded scene videos covering various natural environments. The dataset comprises dynamic tracking occlusions, static scene occlusions, and multi-view interactive occlusions, aiming to fill gaps in existing datasets.

github

View Details

MUSIC Dataset

Music Recognition

Video Analysis

This dataset contains YouTube video IDs used in the Sound of Pixels project, including solo video IDs for 11 and 21 instrument sets, as well as duet performance video IDs. After the paper was published, some noisy videos were removed, so the number of videos slightly differs from the paper.

github

View Details

AVA Dataset

Action Recognition

Video Analysis

The AVA dataset densely annotates 80 atomic visual actions across 57.6k movie clips, providing spatio‑temporal localization of actions and yielding 210k action labels, with multiple person labels frequently appearing in each video clip. Key features include: 1. Definition of atomic visual actions to avoid collecting data for each complex action; 2. Precise spatio‑temporal annotations, potentially multiple annotations per person; 3. Use of diverse real video material (movies).

github

View Details

VidSTG

Video Analysis

Spatio‑Temporal Localization

The VidSTG dataset is built on the video relation dataset VidOR for spatio‑temporal video grounding tasks, especially handling multi‑form sentences. It includes video partition files and sentence annotation files, detailing video IDs, frame counts, frame rates, dimensions, as well as object, relation and temporal ground‑truth annotations.

github

View Details

VT-MOT

Multi‑Object Tracking

Video Analysis

The VT‑MOT dataset was created by the Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, at Anhui University. It is a large‑scale visible‑light and thermal‑infrared video benchmark for multi‑object tracking, containing 582 video pairs (401 k frame pairs) captured from UAVs, surveillance cameras, and handheld devices, with precise spatio‑temporal alignment and 3.99 million high‑quality bounding boxes. The dataset was produced through meticulous frame‑by‑frame alignment and double‑checked annotation, ensuring high quality and density. VT‑MOT is intended for multi‑object tracking in challenging environments, leveraging the complementary strengths of visible and thermal modalities.

arXiv

View Details

TSEC-Dataset

Autonomous Driving

Video Analysis

TSEC‑Dataset was developed for training and testing video captioning methods in driving scenarios, aiming to describe key events involving the ego vehicle, road environment, and other traffic participants. The dataset aggregates videos from various sources, including on‑board cameras, public datasets, and traffic‑accident videos downloaded from BiliBili and YouTube, to capture diverse traffic scenes. Videos are segmented into independent clips containing 1‑3 key events, totaling 8,000 video clips with a cumulative duration of 11.5 hours.

github

View Details