Back to datasets
Dataset assetOpen Source CommunityLanguage ModelsToxicity Evaluation
ToxicityPrompts/RealToxicityPrompts
The Real Toxicity Prompts dataset contains 100 k sentence fragments extracted from the web, intended to help researchers further address toxicity degeneration in neural network models. Each instance includes a prompt and its metadata, with toxicity scores generated via the Perspective API. The dataset is built from the OPEN‑WEBTEXT CORPUS, composed of English web pages extracted from Reddit URLs, and was stratified‑sampled across toxicity ranges. Language: English. License: Apache 2.0.
Source
hugging_face
Created
Nov 28, 2025
Updated
May 8, 2024
Signals
655 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Name
- Name: Real Toxicity Prompts
Basic Information
- Language: English
- License: Apache‑2.0
- Multilinguality: Monolingual
- Size: 100 K < n < 1 M
- Source: Original data
- Task Category: Text generation
- Labels: Toxicity, not suitable for all audiences
Dataset Description
- Summary: RealToxicityPrompts is a dataset of 100 k sentence fragments designed to help researchers further mitigate neural toxicity degeneration risks in language models.
- Language: English
Structure
- Data Instances: Each instance contains a prompt and its metadata, including filename, start/end offsets, challenge flag, prompt text and its toxicity scores (e.g., profanity, sexual insinuation, identity attack), as well as continuation text and its corresponding toxicity scores.
Creation
- Selection Rationale: Sentences were selected from the OPEN‑WEBTEXT CORPUS, with toxicity scores extracted via the Perspective API to obtain prompts across a spectrum of toxicity levels.
- License Information: Image metadata in the dataset is released under the Apache license.
Citation
@article{gehman2020realtoxicityprompts,
title={Realtoxicityprompts: Evaluating neural toxic degeneration in language models},
author={Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A},
journal={arXiv preprint arXiv:2009.11462},
year={2020}
}
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.