Back to datasets
Dataset assetOpen Source CommunityPrivacy ProtectionVision-Language Models

Multi-P2A

Multi-P2A is a comprehensive benchmark dataset created by the Institute of Computing Technology, Chinese Academy of Sciences, intended to evaluate the privacy protection capabilities of large vision‑language models (LVLMs). The dataset covers 26 categories of personal privacy, 15 categories of commercial secrets, and 18 categories of state secrets, totaling 31,962 samples. It is constructed from existing datasets and social media platforms, generating samples via visual question answering (VQA) tasks to ensure high quality and diversity. Multi-P2A is mainly applied in privacy risk assessment, helping developers and researchers identify and mitigate potential privacy leaks in LVLMs during training and inference, thereby advancing privacy protection technologies.

Source
arXiv
Created
Dec 27, 2024
Updated
Dec 27, 2024
Signals
402 views
Availability
Linked source ready
Overview

Dataset description and usage context

Multi-P2A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

Dataset Overview

Multi-P2A is a comprehensive benchmark for assessing privacy protection capabilities of large vision‑language models (LVLMs).

Dataset Content

  • Privacy Awareness: Evaluates the model's ability to recognize privacy‑sensitive input data, including images, queries, and privacy information flow risks across various scenarios.
  • Privacy Leakage: Assesses the risk of privacy information leaking in model outputs, divided into three categories: (1) extracting privacy information from images, (2) inferring privacy from images, (3) leaking sensitive information from training data.

Dataset Scale

  • Total Samples: 31,962
  • Privacy Categories:
    • Personal privacy: 26 types
    • Commercial secrets: 15 types
    • State secrets: 18 types

Tasks and Distribution

  • Privacy Image Recognition: 3,202 samples
  • Privacy Question Detection: 14,184 samples
  • Privacy Information Flow Evaluation: 392 samples
  • Perceptual Leakage: 2,232 samples
  • Inferential Leakage: 2,682 samples
  • Memory Leakage: 3,798 samples
  • Non‑Sensitive Questions: 5,472 samples

Related Projects

Dataset Access

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio