google/civil_comments
The dataset comprises publicly available comments from the Civil Comments platform, which served as a commenting plugin for independent news websites. These comments were created between 2015 and 2017 and appeared on approximately 50 English-language news sites worldwide. When Civil Comments shut down in 2017, the public comments were preserved in a permanent open archive for future research. The original data include the comment text, some associated metadata (e.g., article ID, timestamp, and the commenter‑generated “civil” label), but omit user IDs. Jigsaw extended this dataset by adding additional toxicity and identity‑mention labels. This dataset is an exact replica of the data used in Jigsaw’s “Unintended Bias in Toxicity Classification” challenge on Kaggle. Both the dataset and the underlying comment texts are released under the CC0 license.
Description
Dataset Overview
Basic Information
- Name: Civil Comments
- License: CC0-1.0
- Tags: toxic-comment-classification
- Task Category: text-classification
- Task ID: multi-label-classification
Dataset Structure
Data Features
- text: string type
- toxicity: float32
- severe_toxicity: float32
- obscene: float32
- threat: float32
- insult: float32
- identity_attack: float32
- sexual_explicit: float32
Data Splits
- Training set: 1,804,874 samples
- Validation set: 97,320 samples
- Test set: 97,320 samples
Dataset Creation
License Information
- License: CC0 1.0
Citation Information
@article{DBLP:journals/corr/abs-1903-04561, author = {Daniel Borkan and Lucas Dixon and Jeffrey Sorensen and Nithum Thain and Lucy Vasserman}, title = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification}, journal = {CoRR}, volume = {abs/1903.04561}, year = {2019}, url = {http://arxiv.org/abs/1903.04561}, archivePrefix = {arXiv}, eprint = {1903.04561}, timestamp = {Sun, 31 Mar 2019 19:01:24 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561}, bibsource = {dblp computer science bibliography, https://dblp.org} }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.