JUHE API Marketplace
DATASET
Open Source Community

Sensitive data detection dataset

This dataset is for sensitive data detection, generated by the OpenLLaMa model, covering multiple sensitive data categories such as health, politics, sexual orientation, judicial, religion, and ethnicity. Each entry contains text content and its sensitivity label, with the label specifying the category and location of the sensitive information.

Updated 12/2/2023
github

Description

Dataset Overview

Dataset Name

Sensitive data detection

Dataset Source

The dataset supports the paper "Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data" and is currently submitted to the IRCDL 2024 Conference for review.

Generation Model

The dataset was generated by the OpenLLaMa model and is freely distributed under the Apache 2.0 license.

Format

The dataset consists of a series of entries, each containing a (text, label) pair, where:

  • text: the generated document content.
  • label: a list of labels, each specifying a sensitive data category and the start‑ and end‑character indices of the sensitive segment.

Label Categories

  • [DATI_SALUTE]: Health‑related information.
  • [DATI_POLITICA]: Political information.
  • [DATI_SESSUALITA]: Sexual information.
  • [DATI_GIUDIZIARI]: Judicial information.
  • [DATI_RELIGIONE]: Religious information.
  • [DATI_ETNIA]: Ethnicity‑related information.

Document Types and Counts

Data categoryDoc titleDoc countAvg char count
Mix (Sensitive)Email2331072.91
Newspaper article1481004.84
Curriculum Vitae1091241.66
TOT490
Health & SexualityPsychiatric report831638.61
Medical prescription711372.89
Medical records621476.06
Psychological evaluation531899.11
Certification of invalidity201419.95
Biopsy results211123.29
Eye test report151455.33
Surgery report151475.53
Blood tests101437.5
Certificate of civil union201116.95
TOT370
JudicialDenunciation report261225.58
Police identikit251097.16
Criminal record241305.08
Arrest report191234.16
Notice of investigation241806.88
Criminal judgement221403.82
Notice of conclusion of preliminary investigations191610.89
Certificate of pending charges141268.14
Precautionary measures181281.28
TOT191
PoliticPolitical endorsement38735.39
Union card301066.77
Party card281067.14
TOT96
PhilosophicalPhilosophical endorsement681028.49
Baptismal certificate32783.47
Certificate of participation to religious group32682.59
TOT132
EthnicDNA analysis report371804.86
Ancestry analysis report311824.16
Birth certificate36435.61
Genealogical tree report301444.67
TOT134
Other (Non sensitive)Scientific paper1063038.55
Advertising flyer for event92992.42
Scientific publications report882576.69
Marriage certificate66829.39
Advertising flyer181035.83
Company invoice171203.71
Services and products catalogue152122.2
Financial report by corporate142215.5
Commercial report132128.92
City travel guide121598.75
Tax declaration121476.58
Cooking recipe111168.64
Corporate memo91277.67
Company balance sheet81443.12
Book review7767
Wikipedia extract1502430.3
TOT638

References

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Sensitive Data Detection
Data Privacy Protection

Source

Organization: github

Created: 11/24/2023

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.