Dataset assetOpen Source CommunityPrivacy ProtectionVision-Language Models

Multi-P2A

Multi-P2A is a comprehensive benchmark dataset created by the Institute of Computing Technology, Chinese Academy of Sciences, intended to evaluate the privacy protection capabilities of large vision‑language models (LVLMs). The dataset covers 26 categories of personal privacy, 15 categories of commercial secrets, and 18 categories of state secrets, totaling 31,962 samples. It is constructed from existing datasets and social media platforms, generating samples via visual question answering (VQA) tasks to ensure high quality and diversity. Multi-P2A is mainly applied in privacy risk assessment, helping developers and researchers identify and mitigate potential privacy leaks in LVLMs during training and inference, thereby advancing privacy protection technologies.

Source

arXiv

Created

Dec 27, 2024

Updated

Dec 27, 2024

Signals

402 views

Availability

Linked source ready

Overview

Dataset description and usage context

Multi-P²A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

Dataset Overview

Multi-P²A is a comprehensive benchmark for assessing privacy protection capabilities of large vision‑language models (LVLMs).

Dataset Content

Privacy Awareness: Evaluates the model's ability to recognize privacy‑sensitive input data, including images, queries, and privacy information flow risks across various scenarios.
Privacy Leakage: Assesses the risk of privacy information leaking in model outputs, divided into three categories: (1) extracting privacy information from images, (2) inferring privacy from images, (3) leaking sensitive information from training data.

Dataset Scale

Total Samples: 31,962
Privacy Categories:
- Personal privacy: 26 types
- Commercial secrets: 15 types
- State secrets: 18 types

Tasks and Distribution

Privacy Image Recognition: 3,202 samples
Privacy Question Detection: 14,184 samples
Privacy Information Flow Evaluation: 392 samples
Perceptual Leakage: 2,232 samples
Inferential Leakage: 2,682 samples
Memory Leakage: 3,798 samples
Non‑Sensitive Questions: 5,472 samples

Related Projects

Dataset Access

Google Drive: https://drive.google.com/file/d/1AF38j46PbDSIHSeruuxu4IwMswKH1wmX/view?usp=drive_link
Baidu Netdisk: https://pan.baidu.com/s/1UyvHVn6rasTO9dwK5-UGxQ?pwd=kuui

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio