Back to datasets
Dataset assetOpen Source CommunitySensitive Data DetectionData Privacy Protection

Sensitive data detection dataset

This dataset is for sensitive data detection, generated by the OpenLLaMa model, covering multiple sensitive data categories such as health, politics, sexual orientation, judicial, religion, and ethnicity. Each entry contains text content and its sensitivity label, with the label specifying the category and location of the sensitive information.

Source
github
Created
Nov 24, 2023
Updated
Dec 2, 2023
Signals
824 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Sensitive data detection

Dataset Source

The dataset supports the paper "Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data" and is currently submitted to the IRCDL 2024 Conference for review.

Generation Model

The dataset was generated by the OpenLLaMa model and is freely distributed under the Apache 2.0 license.

Format

The dataset consists of a series of entries, each containing a (text, label) pair, where:

  • text: the generated document content.
  • label: a list of labels, each specifying a sensitive data category and the start‑ and end‑character indices of the sensitive segment.

Label Categories

  • [DATI_SALUTE]: Health‑related information.
  • [DATI_POLITICA]: Political information.
  • [DATI_SESSUALITA]: Sexual information.
  • [DATI_GIUDIZIARI]: Judicial information.
  • [DATI_RELIGIONE]: Religious information.
  • [DATI_ETNIA]: Ethnicity‑related information.

Document Types and Counts

Data categoryDoc titleDoc countAvg char count
Mix (Sensitive)Email2331072.91
Newspaper article1481004.84
Curriculum Vitae1091241.66
TOT490
Health & SexualityPsychiatric report831638.61
Medical prescription711372.89
Medical records621476.06
Psychological evaluation531899.11
Certification of invalidity201419.95
Biopsy results211123.29
Eye test report151455.33
Surgery report151475.53
Blood tests101437.5
Certificate of civil union201116.95
TOT370
JudicialDenunciation report261225.58
Police identikit251097.16
Criminal record241305.08
Arrest report191234.16
Notice of investigation241806.88
Criminal judgement221403.82
Notice of conclusion of preliminary investigations191610.89
Certificate of pending charges141268.14
Precautionary measures181281.28
TOT191
PoliticPolitical endorsement38735.39
Union card301066.77
Party card281067.14
TOT96
PhilosophicalPhilosophical endorsement681028.49
Baptismal certificate32783.47
Certificate of participation to religious group32682.59
TOT132
EthnicDNA analysis report371804.86
Ancestry analysis report311824.16
Birth certificate36435.61
Genealogical tree report301444.67
TOT134
Other (Non sensitive)Scientific paper1063038.55
Advertising flyer for event92992.42
Scientific publications report882576.69
Marriage certificate66829.39
Advertising flyer181035.83
Company invoice171203.71
Services and products catalogue152122.2
Financial report by corporate142215.5
Commercial report132128.92
City travel guide121598.75
Tax declaration121476.58
Cooking recipe111168.64
Corporate memo91277.67
Company balance sheet81443.12
Book review7767
Wikipedia extract1502430.3
TOT638

References

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio