JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

MTA

Multi‑Object Tracking
Video Analysis

MTA (Multi‑Camera Track Auto) is a large multi‑target multi‑camera tracking dataset, containing over 2,800 person identities captured by 6 cameras, each video exceeding 100 minutes. The dataset spans both daytime and nighttime periods.

github
View Details

P2ANET

Video Analysis
Sports Technology

The P2ANET dataset is a large‑scale benchmark for dense action detection from table‑tennis broadcast videos. It consists of two parts: a raw dataset and a processed dataset, collected in two batches (v1 and v2). Video data were captured with an RGB monocular camera, and labels were obtained via manual annotation.

github
View Details

COIN Dataset

Video Analysis
Dataset

COIN is currently the largest comprehensive instructional video analysis dataset, containing 11,827 videos covering 180 different tasks across 12 domains. All videos are collected from YouTube and annotated using an efficient toolbox.

github
View Details

UCF-101, HMDB-51

Video Analysis
Action Recognition

UCF‑101 and HMDB‑51 are two video datasets used for training and testing video‑processing models. UCF‑101 contains 101 action categories with over 100 videos per category. HMDB‑51 includes 51 action categories with at least 101 videos per category.

github
View Details

OccludeNet

Video Analysis
Occlusion Handling

OccludeNet is a large-scale occlusion video dataset that includes real-world and synthetic occluded scene videos covering various natural environments. The dataset comprises dynamic tracking occlusions, static scene occlusions, and multi-view interactive occlusions, aiming to fill gaps in existing datasets.

github
View Details

MUSIC Dataset

Music Recognition
Video Analysis

This dataset contains YouTube video IDs used in the Sound of Pixels project, including solo video IDs for 11 and 21 instrument sets, as well as duet performance video IDs. After the paper was published, some noisy videos were removed, so the number of videos slightly differs from the paper.

github
View Details

AVA Dataset

Action Recognition
Video Analysis

The AVA dataset densely annotates 80 atomic visual actions across 57.6k movie clips, providing spatio‑temporal localization of actions and yielding 210k action labels, with multiple person labels frequently appearing in each video clip. Key features include: 1. Definition of atomic visual actions to avoid collecting data for each complex action; 2. Precise spatio‑temporal annotations, potentially multiple annotations per person; 3. Use of diverse real video material (movies).

github
View Details

VidSTG

Video Analysis
Spatio‑Temporal Localization

The VidSTG dataset is built on the video relation dataset VidOR for spatio‑temporal video grounding tasks, especially handling multi‑form sentences. It includes video partition files and sentence annotation files, detailing video IDs, frame counts, frame rates, dimensions, as well as object, relation and temporal ground‑truth annotations.

github
View Details

VT-MOT

Multi‑Object Tracking
Video Analysis

The VT‑MOT dataset was created by the Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, at Anhui University. It is a large‑scale visible‑light and thermal‑infrared video benchmark for multi‑object tracking, containing 582 video pairs (401 k frame pairs) captured from UAVs, surveillance cameras, and handheld devices, with precise spatio‑temporal alignment and 3.99 million high‑quality bounding boxes. The dataset was produced through meticulous frame‑by‑frame alignment and double‑checked annotation, ensuring high quality and density. VT‑MOT is intended for multi‑object tracking in challenging environments, leveraging the complementary strengths of visible and thermal modalities.

arXiv
View Details

TSEC-Dataset

Autonomous Driving
Video Analysis

TSEC‑Dataset was developed for training and testing video captioning methods in driving scenarios, aiming to describe key events involving the ego vehicle, road environment, and other traffic participants. The dataset aggregates videos from various sources, including on‑board cameras, public datasets, and traffic‑accident videos downloaded from BiliBili and YouTube, to capture diverse traffic scenes. Videos are segmented into independent clips containing 1‑3 key events, totaling 8,000 video clips with a cumulative duration of 11.5 hours.

github
View Details