ToxicityPrompts/RealToxicityPrompts
The Real Toxicity Prompts dataset contains 100 k sentence fragments extracted from the web, intended to help researchers further address toxicity degeneration in neural network models. Each instance includes a prompt and its metadata, with toxicity scores generated via the Perspective API. The dataset is built from the OPEN‑WEBTEXT CORPUS, composed of English web pages extracted from Reddit URLs, and was stratified‑sampled across toxicity ranges. Language: English. License: Apache 2.0.
Description
Dataset Overview
Dataset Name
- Name: Real Toxicity Prompts
Basic Information
- Language: English
- License: Apache‑2.0
- Multilinguality: Monolingual
- Size: 100 K < n < 1 M
- Source: Original data
- Task Category: Text generation
- Labels: Toxicity, not suitable for all audiences
Dataset Description
- Summary: RealToxicityPrompts is a dataset of 100 k sentence fragments designed to help researchers further mitigate neural toxicity degeneration risks in language models.
- Language: English
Structure
- Data Instances: Each instance contains a prompt and its metadata, including filename, start/end offsets, challenge flag, prompt text and its toxicity scores (e.g., profanity, sexual insinuation, identity attack), as well as continuation text and its corresponding toxicity scores.
Creation
- Selection Rationale: Sentences were selected from the OPEN‑WEBTEXT CORPUS, with toxicity scores extracted via the Perspective API to obtain prompts across a spectrum of toxicity levels.
- License Information: Image metadata in the dataset is released under the Apache license.
Citation
@article{gehman2020realtoxicityprompts,
title={Realtoxicityprompts: Evaluating neural toxic degeneration in language models},
author={Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A},
journal={arXiv preprint arXiv:2009.11462},
year={2020}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.