VidSTG

VidSTG

The VidSTG dataset is built on the video relation dataset VidOR for spatio‑temporal video grounding tasks, especially handling multi‑form sentences. It includes video partition files and sentence annotation files, detailing video IDs, frame counts, frame rates, dimensions, as well as object, relation and temporal ground‑truth annotations.

Updated 4/22/2024

github

Dataset Overview

Source

VidSTG: Constructed from the video relation dataset VidOR.

Composition

Original VidOR: 7,000 training videos, 835 validation videos, and 2,165 test videos (test annotations are unavailable and thus omitted).
VidSTG: 10 % of the training videos are used as validation data; the original validation set serves as the test set.

Video Partition Files: train_files.json, val_files.json, test_files.json containing video IDs for each split.
Sentence Annotation Files: train_annotations.json, val_annotations.json, test_annotations.json.

Annotation Structure

Video ID: Unique identifier.
Frame Count: Number of frames.
Resolution: Width and height.
Subject/Object List: IDs and categories.
Temporal Segment: Frame range used.
Relations: Subject ID, object ID, predicate, and frame range.
Temporal Ground‑Truth: Time span of each relation.
Caption: Descriptive sentence.
Question: Query sentence about the video.

Citation

If you use this dataset, please cite:

VidSTG paper: Zhang, Zhu et al. "Where Does It Exist: Spatio‑Temporal Video Grounding for Multi‑Form Sentences". CVPR, 2020.
VidOR paper: Shang, Xindi et al. "Annotating Objects and Relations in User‑Generated Videos". International Conference on Multimedia Retrieval, 2019.

Description

Dataset Overview

Source

Composition

Contents

Annotation Structure

Citation

AI studio

Access Dataset

Topics

Source