google/civil_comments

The dataset comprises publicly available comments from the Civil Comments platform, which served as a commenting plugin for independent news websites. These comments were created between 2015 and 2017 and appeared on approximately 50 English-language news sites worldwide. When Civil Comments shut down in 2017, the public comments were preserved in a permanent open archive for future research. The original data include the comment text, some associated metadata (e.g., article ID, timestamp, and the commenter‑generated “civil” label), but omit user IDs. Jigsaw extended this dataset by adding additional toxicity and identity‑mention labels. This dataset is an exact replica of the data used in Jigsaw’s “Unintended Bias in Toxicity Classification” challenge on Kaggle. Both the dataset and the underlying comment texts are released under the CC0 license.

Updated 1/25/2024

hugging_face

Description

Dataset Overview

Basic Information

Name: Civil Comments
License: CC0-1.0
Tags: toxic-comment-classification
Task Category: text-classification
Task ID: multi-label-classification

Dataset Structure

Data Features

text: string type
toxicity: float32
severe_toxicity: float32
obscene: float32
threat: float32
insult: float32
identity_attack: float32
sexual_explicit: float32

Data Splits

Training set: 1,804,874 samples
Validation set: 97,320 samples
Test set: 97,320 samples

Dataset Creation

License Information

License: CC0 1.0

Citation Information

@article{DBLP:journals/corr/abs-1903-04561, author = {Daniel Borkan and Lucas Dixon and Jeffrey Sorensen and Nithum Thain and Lucy Vasserman}, title = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification}, journal = {CoRR}, volume = {abs/1903.04561}, year = {2019}, url = {http://arxiv.org/abs/1903.04561}, archivePrefix = {arXiv}, eprint = {1903.04561}, timestamp = {Sun, 31 Mar 2019 19:01:24 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561}, bibsource = {dblp computer science bibliography, https://dblp.org} }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Online Comment Analysis

Text Mining

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →