JUHE API Marketplace
DATASET
Open Source Community

google/civil_comments

The dataset comprises publicly available comments from the Civil Comments platform, which served as a commenting plugin for independent news websites. These comments were created between 2015 and 2017 and appeared on approximately 50 English-language news sites worldwide. When Civil Comments shut down in 2017, the public comments were preserved in a permanent open archive for future research. The original data include the comment text, some associated metadata (e.g., article ID, timestamp, and the commenter‑generated “civil” label), but omit user IDs. Jigsaw extended this dataset by adding additional toxicity and identity‑mention labels. This dataset is an exact replica of the data used in Jigsaw’s “Unintended Bias in Toxicity Classification” challenge on Kaggle. Both the dataset and the underlying comment texts are released under the CC0 license.

Updated 1/25/2024
hugging_face

Description

Dataset Overview

Basic Information

  • Name: Civil Comments
  • License: CC0-1.0
  • Tags: toxic-comment-classification
  • Task Category: text-classification
  • Task ID: multi-label-classification

Dataset Structure

Data Features

  • text: string type
  • toxicity: float32
  • severe_toxicity: float32
  • obscene: float32
  • threat: float32
  • insult: float32
  • identity_attack: float32
  • sexual_explicit: float32

Data Splits

  • Training set: 1,804,874 samples
  • Validation set: 97,320 samples
  • Test set: 97,320 samples

Dataset Creation

License Information

  • License: CC0 1.0

Citation Information

@article{DBLP:journals/corr/abs-1903-04561, author = {Daniel Borkan and Lucas Dixon and Jeffrey Sorensen and Nithum Thain and Lucy Vasserman}, title = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification}, journal = {CoRR}, volume = {abs/1903.04561}, year = {2019}, url = {http://arxiv.org/abs/1903.04561}, archivePrefix = {arXiv}, eprint = {1903.04561}, timestamp = {Sun, 31 Mar 2019 19:01:24 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561}, bibsource = {dblp computer science bibliography, https://dblp.org} }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Online Comment Analysis
Text Mining

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.