Back to datasets
Dataset assetOpen Source CommunityLanguage ModelsToxicity Evaluation

ToxicityPrompts/RealToxicityPrompts

The Real Toxicity Prompts dataset contains 100 k sentence fragments extracted from the web, intended to help researchers further address toxicity degeneration in neural network models. Each instance includes a prompt and its metadata, with toxicity scores generated via the Perspective API. The dataset is built from the OPEN‑WEBTEXT CORPUS, composed of English web pages extracted from Reddit URLs, and was stratified‑sampled across toxicity ranges. Language: English. License: Apache 2.0.

Source
hugging_face
Created
Nov 28, 2025
Updated
May 8, 2024
Signals
655 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

  • Name: Real Toxicity Prompts

Basic Information

  • Language: English
  • License: Apache‑2.0
  • Multilinguality: Monolingual
  • Size: 100 K < n < 1 M
  • Source: Original data
  • Task Category: Text generation
  • Labels: Toxicity, not suitable for all audiences

Dataset Description

  • Summary: RealToxicityPrompts is a dataset of 100 k sentence fragments designed to help researchers further mitigate neural toxicity degeneration risks in language models.
  • Language: English

Structure

  • Data Instances: Each instance contains a prompt and its metadata, including filename, start/end offsets, challenge flag, prompt text and its toxicity scores (e.g., profanity, sexual insinuation, identity attack), as well as continuation text and its corresponding toxicity scores.

Creation

  • Selection Rationale: Sentences were selected from the OPEN‑WEBTEXT CORPUS, with toxicity scores extracted via the Perspective API to obtain prompts across a spectrum of toxicity levels.
  • License Information: Image metadata in the dataset is released under the Apache license.

Citation

@article{gehman2020realtoxicityprompts,
  title={Realtoxicityprompts: Evaluating neural toxic degeneration in language models},
  author={Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A},
  journal={arXiv preprint arXiv:2009.11462},
  year={2020}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio