Explore high-quality datasets for your AI and machine learning projects.
MTA (Multi‑Camera Track Auto) is a large multi‑target multi‑camera tracking dataset, containing over 2,800 person identities captured by 6 cameras, each video exceeding 100 minutes. The dataset spans both daytime and nighttime periods.
The P2ANET dataset is a large‑scale benchmark for dense action detection from table‑tennis broadcast videos. It consists of two parts: a raw dataset and a processed dataset, collected in two batches (v1 and v2). Video data were captured with an RGB monocular camera, and labels were obtained via manual annotation.
COIN is currently the largest comprehensive instructional video analysis dataset, containing 11,827 videos covering 180 different tasks across 12 domains. All videos are collected from YouTube and annotated using an efficient toolbox.
UCF‑101 and HMDB‑51 are two video datasets used for training and testing video‑processing models. UCF‑101 contains 101 action categories with over 100 videos per category. HMDB‑51 includes 51 action categories with at least 101 videos per category.
OccludeNet is a large-scale occlusion video dataset that includes real-world and synthetic occluded scene videos covering various natural environments. The dataset comprises dynamic tracking occlusions, static scene occlusions, and multi-view interactive occlusions, aiming to fill gaps in existing datasets.
This dataset contains YouTube video IDs used in the Sound of Pixels project, including solo video IDs for 11 and 21 instrument sets, as well as duet performance video IDs. After the paper was published, some noisy videos were removed, so the number of videos slightly differs from the paper.
The AVA dataset densely annotates 80 atomic visual actions across 57.6k movie clips, providing spatio‑temporal localization of actions and yielding 210k action labels, with multiple person labels frequently appearing in each video clip. Key features include: 1. Definition of atomic visual actions to avoid collecting data for each complex action; 2. Precise spatio‑temporal annotations, potentially multiple annotations per person; 3. Use of diverse real video material (movies).
The VidSTG dataset is built on the video relation dataset VidOR for spatio‑temporal video grounding tasks, especially handling multi‑form sentences. It includes video partition files and sentence annotation files, detailing video IDs, frame counts, frame rates, dimensions, as well as object, relation and temporal ground‑truth annotations.
The VT‑MOT dataset was created by the Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, at Anhui University. It is a large‑scale visible‑light and thermal‑infrared video benchmark for multi‑object tracking, containing 582 video pairs (401 k frame pairs) captured from UAVs, surveillance cameras, and handheld devices, with precise spatio‑temporal alignment and 3.99 million high‑quality bounding boxes. The dataset was produced through meticulous frame‑by‑frame alignment and double‑checked annotation, ensuring high quality and density. VT‑MOT is intended for multi‑object tracking in challenging environments, leveraging the complementary strengths of visible and thermal modalities.
TSEC‑Dataset was developed for training and testing video captioning methods in driving scenarios, aiming to describe key events involving the ego vehicle, road environment, and other traffic participants. The dataset aggregates videos from various sources, including on‑board cameras, public datasets, and traffic‑accident videos downloaded from BiliBili and YouTube, to capture diverse traffic scenes. Videos are segmented into independent clips containing 1‑3 key events, totaling 8,000 video clips with a cumulative duration of 11.5 hours.