Sensitive data detection dataset
This dataset is for sensitive data detection, generated by the OpenLLaMa model, covering multiple sensitive data categories such as health, politics, sexual orientation, judicial, religion, and ethnicity. Each entry contains text content and its sensitivity label, with the label specifying the category and location of the sensitive information.
Description
Dataset Overview
Dataset Name
Sensitive data detection
Dataset Source
The dataset supports the paper "Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data" and is currently submitted to the IRCDL 2024 Conference for review.
Generation Model
The dataset was generated by the OpenLLaMa model and is freely distributed under the Apache 2.0 license.
Format
The dataset consists of a series of entries, each containing a (text, label) pair, where:
text: the generated document content.label: a list of labels, each specifying a sensitive data category and the start‑ and end‑character indices of the sensitive segment.
Label Categories
- [DATI_SALUTE]: Health‑related information.
- [DATI_POLITICA]: Political information.
- [DATI_SESSUALITA]: Sexual information.
- [DATI_GIUDIZIARI]: Judicial information.
- [DATI_RELIGIONE]: Religious information.
- [DATI_ETNIA]: Ethnicity‑related information.
Document Types and Counts
| Data category | Doc title | Doc count | Avg char count |
|---|---|---|---|
| Mix (Sensitive) | 233 | 1072.91 | |
| Newspaper article | 148 | 1004.84 | |
| Curriculum Vitae | 109 | 1241.66 | |
| TOT | 490 | ||
| Health & Sexuality | Psychiatric report | 83 | 1638.61 |
| Medical prescription | 71 | 1372.89 | |
| Medical records | 62 | 1476.06 | |
| Psychological evaluation | 53 | 1899.11 | |
| Certification of invalidity | 20 | 1419.95 | |
| Biopsy results | 21 | 1123.29 | |
| Eye test report | 15 | 1455.33 | |
| Surgery report | 15 | 1475.53 | |
| Blood tests | 10 | 1437.5 | |
| Certificate of civil union | 20 | 1116.95 | |
| TOT | 370 | ||
| Judicial | Denunciation report | 26 | 1225.58 |
| Police identikit | 25 | 1097.16 | |
| Criminal record | 24 | 1305.08 | |
| Arrest report | 19 | 1234.16 | |
| Notice of investigation | 24 | 1806.88 | |
| Criminal judgement | 22 | 1403.82 | |
| Notice of conclusion of preliminary investigations | 19 | 1610.89 | |
| Certificate of pending charges | 14 | 1268.14 | |
| Precautionary measures | 18 | 1281.28 | |
| TOT | 191 | ||
| Politic | Political endorsement | 38 | 735.39 |
| Union card | 30 | 1066.77 | |
| Party card | 28 | 1067.14 | |
| TOT | 96 | ||
| Philosophical | Philosophical endorsement | 68 | 1028.49 |
| Baptismal certificate | 32 | 783.47 | |
| Certificate of participation to religious group | 32 | 682.59 | |
| TOT | 132 | ||
| Ethnic | DNA analysis report | 37 | 1804.86 |
| Ancestry analysis report | 31 | 1824.16 | |
| Birth certificate | 36 | 435.61 | |
| Genealogical tree report | 30 | 1444.67 | |
| TOT | 134 | ||
| Other (Non sensitive) | Scientific paper | 106 | 3038.55 |
| Advertising flyer for event | 92 | 992.42 | |
| Scientific publications report | 88 | 2576.69 | |
| Marriage certificate | 66 | 829.39 | |
| Advertising flyer | 18 | 1035.83 | |
| Company invoice | 17 | 1203.71 | |
| Services and products catalogue | 15 | 2122.2 | |
| Financial report by corporate | 14 | 2215.5 | |
| Commercial report | 13 | 2128.92 | |
| City travel guide | 12 | 1598.75 | |
| Tax declaration | 12 | 1476.58 | |
| Cooking recipe | 11 | 1168.64 | |
| Corporate memo | 9 | 1277.67 | |
| Company balance sheet | 8 | 1443.12 | |
| Book review | 7 | 767 | |
| Wikipedia extract | 150 | 2430.3 | |
| TOT | 638 |
References
- [1] Geng, Xinyang and Liu, Hao, OpenLLaMA: An Open Reproduction of LLaMA, https://github.com/openlm-research/open_llama
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 11/24/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.