nvidia/Aegis-AI-Content-Safety-Dataset-1.0

Aegis AI Content Safety Dataset 1.0 is an open‑source content safety dataset (CC‑BY‑4.0) that follows Nvidia's content safety taxonomy, covering 13 key risk categories. It includes approximately 11,000 human‑annotated interaction records between humans and LLMs, split into 10,798 training samples and 1,199 test samples. The data originate from Anthropic HH‑RLHF and Mistral‑7B‑v0.1, annotated by 12 annotators and 2 data quality assurance personnel. The dataset is intended for building content‑moderation safeguards and aligning LLMs to generate safe responses, but is not suitable for training dialogue agents. Its creation involved strict QA and annotator training to ensure diversity and accuracy.

Updated 6/28/2024

hugging_face

Description

🛡️ Aegis AI Content Safety Dataset 1.0

Dataset Details

Dataset Description

Aegis AI Content Safety Dataset is an open‑source content safety dataset (CC‑BY‑4.0) containing about 11,000 human‑annotated human‑LLM interaction records, split into 10,798 training samples and 1,199 test samples. The dataset follows Nvidia's content safety taxonomy, covering 13 key risk categories.

The dataset was built using the Hugging Face version of Anthropic HH‑RLHF for harmlessness human‑preference data, extracting only prompts, and generating responses with Mistral‑7B‑v0.1. Four formats are provided: user‑only prompt, system‑prompt + user prompt, single‑turn user prompt + Mistral response, and multi‑turn user prompt + Mistral response.

Samples are annotated according to the following taxonomy:

Nvidia Content Safety Taxonomy
Hate/Identity Hate
Sexual Content
Violence
Suicide and Self‑Harm
Threats
Sexual Minorities
Guns/Illegal Weapons
Regulated/Controlled Substances
Criminal Planning/Confession
PII
Harassment
Blasphemy
Other
Needs Caution

Curators: Shaona Ghosh, Nvidia
Paper: AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
Language: English (may contain a small number of samples in other languages)
License: CC‑BY‑4.0

Dataset Source

Used the Hugging Face version of Anthropic HH‑RLHF for harmlessness human‑preference data.
Extracted only prompts and generated responses with Mistral‑7B‑v0.1.
Annotation work was performed by twelve annotators and two data‑quality‑assurance personnel, using Nvidia's data team.

Intended Uses

Direct Uses

Build content‑moderation safeguards around LLMs.
Use SFT or DPO methods to make LLMs generate safe responses.

Out‑of‑Scope Uses

The data contain potentially offensive or unsettling content, including discriminatory language and discussions of abuse, violence, self‑harm, exploitation, and other disturbing topics. Users should interact with the data only according to their personal risk tolerance. The dataset is intended for research purposes, especially research that can reduce model harms. The viewpoints expressed do not represent Nvidia or its employees. The data are not suitable for training dialogue agents, as this could lead to harmful model behavior.

Dataset Creation

Annotation Process

Quality Assurance (QA) was maintained by the project leads. Two to three times per week, the lead randomly selected 15 questions out of every 100 and had three annotators re‑evaluate them. This accounted for 15 % of the data analysis, with typically at least 20 %–30 % of the data being examined further to ensure quality. Corrections were sent to each annotator with brief explanations of why certain changes were made according to the project guidelines. The dataset was usually divided into 2,000–4,000 text‑based prompts and distributed in 3–5 batches to annotators. Between batches, the lead (or a designated lead) gave annotators a full day to self‑correct their classification work. Numerous training sessions were held throughout the project to provide best‑practice tips for self‑correction, including keyword filtering, referencing an regularly‑updated FAQ table and example questions, and randomly re‑evaluating completed questions. Annotators were instructed to self‑correct only their own work and avoid viewing any other annotator’s labels. Two leads were always on standby to ensure consistent understanding of the material. Mandatory virtual group trainings were held every two weeks or as needed, led by the lead and often using examples of common divergences as learning opportunities.

Who Were the Annotators?

During the three‑month content‑moderation safeguard project, we averaged 12 annotators. Four had engineering backgrounds, specializing in data analysis, collection, gaming, and robotics. Eight had creative‑writing backgrounds, specializing in linguistics, R&D, and other creative arts such as photography and film. All annotators underwent extensive training and were capable of using large language models (LLMs) as well as other generative AI such as image retrieval or multi‑turn dialogue evaluation. All 12 annotators resided in the United States and represented diverse racial, religious, age, and socioeconomic backgrounds.

Personal and Sensitive Information

The dataset contains LLM responses generated by Mistral‑7B‑v0.1. We have carefully inspected the data and removed any content that might contain personal information. However, it is possible that some personal information remains. If you discover any content you believe should not be public, please notify us immediately.

Bias, Risks, and Limitations

Safety and Review: The dataset is intended for building content‑moderation systems or aligning LLMs to generate safe responses. Due to the nature of the work, the dataset contains critical unsafe content and annotations for that content. Extreme caution should be exercised when referencing and using this dataset.
Legal Compliance: Users are responsible for ensuring appropriate use. The dataset should not be used in ways that conflict with legal and ethical standards.
Non‑Identifiability: Users agree not to attempt to identify individuals present in the dataset.

Ethical Statement

The creation of the Aegis AI Content Safety Dataset followed ethical data classification practices, using the open‑source annotation tool Label Studio, commonly employed in Nvidia internal projects. The tool’s design allows large amounts of data to be analyzed by a single annotator without seeing a peer’s work, mitigating annotator bias and providing each individual with varied prompts to avoid repeated task patterns.

Given the serious nature of the project, annotators were required to volunteer, based on their skill level, availability, and willingness to be exposed to potentially toxic content. Prior to beginning work, all participants signed an “Adult Content Confirmation” consistent with the organization’s existing anti‑harassment policies and code of conduct. This ensured annotators understood the nature of the work and the resources available should the material affect their mental health. Project leads held regular one‑on‑one meetings with each annotator to confirm ongoing comfort with the material and continued participation.

Citation

BibTeX:

@article{ghosh2024aegis, title={AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts}, author={Ghosh, Shaona and Varshney, Prasoon and Galinkin, Erick and Parisien, Christopher}, journal={arXiv preprint arXiv:2404.05993}, year={2024} }

Dataset Card Authors

Shaona Ghosh, shaonag@nvidia.com

Dataset Card Contact

shaonag@nvidia.com

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

AI Content Safety

Text Classification

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →