nvidia/Aegis-AI-Content-Safety-Dataset-1.0
Aegis AI Content Safety Dataset 1.0 is an open‑source content safety dataset (CC‑BY‑4.0) that follows Nvidia's content safety taxonomy, covering 13 key risk categories. It includes approximately 11,000 human‑annotated interaction records between humans and LLMs, split into 10,798 training samples and 1,199 test samples. The data originate from Anthropic HH‑RLHF and Mistral‑7B‑v0.1, annotated by 12 annotators and 2 data quality assurance personnel. The dataset is intended for building content‑moderation safeguards and aligning LLMs to generate safe responses, but is not suitable for training dialogue agents. Its creation involved strict QA and annotator training to ensure diversity and accuracy.
Description
🛡️ Aegis AI Content Safety Dataset 1.0
Dataset Details
Dataset Description
Aegis AI Content Safety Dataset is an open‑source content safety dataset (CC‑BY‑4.0) containing about 11,000 human‑annotated human‑LLM interaction records, split into 10,798 training samples and 1,199 test samples. The dataset follows Nvidia's content safety taxonomy, covering 13 key risk categories.
The dataset was built using the Hugging Face version of Anthropic HH‑RLHF for harmlessness human‑preference data, extracting only prompts, and generating responses with Mistral‑7B‑v0.1. Four formats are provided: user‑only prompt, system‑prompt + user prompt, single‑turn user prompt + Mistral response, and multi‑turn user prompt + Mistral response.
Samples are annotated according to the following taxonomy:
| Nvidia Content Safety Taxonomy |
|---|
| Hate/Identity Hate |
| Sexual Content |
| Violence |
| Suicide and Self‑Harm |
| Threats |
| Sexual Minorities |
| Guns/Illegal Weapons |
| Regulated/Controlled Substances |
| Criminal Planning/Confession |
| PII |
| Harassment |
| Blasphemy |
| Other |
| Needs Caution |
- Curators: Shaona Ghosh, Nvidia
- Paper: AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
- Language: English (may contain a small number of samples in other languages)
- License: CC‑BY‑4.0
Dataset Source
- Used the Hugging Face version of Anthropic HH‑RLHF for harmlessness human‑preference data.
- Extracted only prompts and generated responses with Mistral‑7B‑v0.1.
- Annotation work was performed by
twelveannotators andtwodata‑quality‑assurance personnel, using Nvidia's data team.
Intended Uses
Direct Uses
- Build content‑moderation safeguards around LLMs.
- Use SFT or DPO methods to make LLMs generate safe responses.
Out‑of‑Scope Uses
The data contain potentially offensive or unsettling content, including discriminatory language and discussions of abuse, violence, self‑harm, exploitation, and other disturbing topics. Users should interact with the data only according to their personal risk tolerance. The dataset is intended for research purposes, especially research that can reduce model harms. The viewpoints expressed do not represent Nvidia or its employees. The data are not suitable for training dialogue agents, as this could lead to harmful model behavior.
Dataset Creation
Annotation Process
Quality Assurance (QA) was maintained by the project leads. Two to three times per week, the lead randomly selected 15 questions out of every 100 and had three annotators re‑evaluate them. This accounted for 15 % of the data analysis, with typically at least 20 %–30 % of the data being examined further to ensure quality. Corrections were sent to each annotator with brief explanations of why certain changes were made according to the project guidelines. The dataset was usually divided into 2,000–4,000 text‑based prompts and distributed in 3–5 batches to annotators. Between batches, the lead (or a designated lead) gave annotators a full day to self‑correct their classification work. Numerous training sessions were held throughout the project to provide best‑practice tips for self‑correction, including keyword filtering, referencing an regularly‑updated FAQ table and example questions, and randomly re‑evaluating completed questions. Annotators were instructed to self‑correct only their own work and avoid viewing any other annotator’s labels. Two leads were always on standby to ensure consistent understanding of the material. Mandatory virtual group trainings were held every two weeks or as needed, led by the lead and often using examples of common divergences as learning opportunities.
Who Were the Annotators?
During the three‑month content‑moderation safeguard project, we averaged 12 annotators. Four had engineering backgrounds, specializing in data analysis, collection, gaming, and robotics. Eight had creative‑writing backgrounds, specializing in linguistics, R&D, and other creative arts such as photography and film. All annotators underwent extensive training and were capable of using large language models (LLMs) as well as other generative AI such as image retrieval or multi‑turn dialogue evaluation. All 12 annotators resided in the United States and represented diverse racial, religious, age, and socioeconomic backgrounds.
Personal and Sensitive Information
The dataset contains LLM responses generated by Mistral‑7B‑v0.1. We have carefully inspected the data and removed any content that might contain personal information. However, it is possible that some personal information remains. If you discover any content you believe should not be public, please notify us immediately.
Bias, Risks, and Limitations
- Safety and Review: The dataset is intended for building content‑moderation systems or aligning LLMs to generate safe responses. Due to the nature of the work, the dataset contains critical unsafe content and annotations for that content. Extreme caution should be exercised when referencing and using this dataset.
- Legal Compliance: Users are responsible for ensuring appropriate use. The dataset should not be used in ways that conflict with legal and ethical standards.
- Non‑Identifiability: Users agree not to attempt to identify individuals present in the dataset.
Ethical Statement
The creation of the Aegis AI Content Safety Dataset followed ethical data classification practices, using the open‑source annotation tool Label Studio, commonly employed in Nvidia internal projects. The tool’s design allows large amounts of data to be analyzed by a single annotator without seeing a peer’s work, mitigating annotator bias and providing each individual with varied prompts to avoid repeated task patterns.
Given the serious nature of the project, annotators were required to volunteer, based on their skill level, availability, and willingness to be exposed to potentially toxic content. Prior to beginning work, all participants signed an “Adult Content Confirmation” consistent with the organization’s existing anti‑harassment policies and code of conduct. This ensured annotators understood the nature of the work and the resources available should the material affect their mental health. Project leads held regular one‑on‑one meetings with each annotator to confirm ongoing comfort with the material and continued participation.
Citation
BibTeX:
@article{ghosh2024aegis, title={AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts}, author={Ghosh, Shaona and Varshney, Prasoon and Galinkin, Erick and Parisien, Christopher}, journal={arXiv preprint arXiv:2404.05993}, year={2024} }
Dataset Card Authors
Shaona Ghosh, shaonag@nvidia.com
Dataset Card Contact
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.