Explore high-quality datasets for your AI and machine learning projects.
The Real Toxicity Prompts dataset contains 100 k sentence fragments extracted from the web, intended to help researchers further address toxicity degeneration in neural network models. Each instance includes a prompt and its metadata, with toxicity scores generated via the Perspective API. The dataset is built from the OPEN‑WEBTEXT CORPUS, composed of English web pages extracted from Reddit URLs, and was stratified‑sampled across toxicity ranges. Language: English. License: Apache 2.0.