Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Showing 5 of 5 datasets

Category: Vision-Language Models

ActPlan-1K

ActPlan‑1K is a multimodal planning benchmark jointly created by the Hong Kong University of Science and Technology and the University of California, San Diego. It evaluates vision‑language models' program planning abilities in domestic activities. The dataset includes 153 activities and 1,187 instances, each comprising a natural‑language task description and multiple environment images captured from the iGibson2 simulator. The creation process combined ChatGPT and iGibson2, converting BDDL activity definitions into natural‑language descriptions and collecting environment images. ActPlan‑1K is primarily used to assess the program planning capabilities of vision‑language models in multimodal tasks, especially for home activities and counterfactual scenarios.

Source arXivUpdated Oct 5, 2024248 viewsLinked

Inspect dataset

Multi-P2A

Privacy ProtectionVision-Language Models

Multi-P2A is a comprehensive benchmark dataset created by the Institute of Computing Technology, Chinese Academy of Sciences, intended to evaluate the privacy protection capabilities of large vision‑language models (LVLMs). The dataset covers 26 categories of personal privacy, 15 categories of commercial secrets, and 18 categories of state secrets, totaling 31,962 samples. It is constructed from existing datasets and social media platforms, generating samples via visual question answering (VQA) tasks to ensure high quality and diversity. Multi-P2A is mainly applied in privacy risk assessment, helping developers and researchers identify and mitigate potential privacy leaks in LVLMs during training and inference, thereby advancing privacy protection technologies.

Source arXivUpdated Dec 27, 2024402 viewsLinked

Inspect dataset

MC-LLaVA Multi-Concept Personalization Dataset

Vision-Language ModelsMulti-Concept Personalization

The MC-LLaVA Multi-Concept Personalization dataset is a high-quality collection designed to advance multi-concept personalization research. It gathers images featuring multiple characters from various movies and manually generates multi-concept question‑answer samples. With diverse movie genres and QA types, the dataset aims to enable vision‑language models to excel in multi-concept personalization tasks.

Source githubUpdated Nov 23, 202479 viewsLinked

Inspect dataset

SPA-VL

Vision-Language ModelsModel Safety

SPA-VL是一个综合的安全偏好对齐数据集，用于视觉语言模型。该数据集包含100,788个样本，覆盖多个领域，旨在通过多样化的模型回答和问题类型来增强模型的安全性和有效性，确保模型在无害性和帮助性两方面得到平衡改进。

Source githubUpdated Jun 12, 2024244 viewsLinked

Inspect dataset

MM-CamObj

Vision-Language ModelsCamouflaged Object Detection

The MM‑CamObj dataset, created by Shanghai Jiao Tong University, addresses challenges for vision‑language models in complex, especially camouflaged‑object, scenarios. It comprises two subsets: CamObj‑Align (11,363 high‑quality image‑text pairs) for vision‑language alignment, and CamObj‑Instruct (11,363 images with 68,849 diverse dialogues) for instruction fine‑tuning. Images were carefully selected from classic datasets and detailed descriptions and dialogues were generated using GPT‑4o. MM‑CamObj is primarily used to evaluate and improve vision‑language models on camouflaged‑object detection, localization, and counting tasks.

Source arXivUpdated Sep 24, 2024229 viewsLinked

Inspect dataset