Back to datasets
Dataset assetOpen Source CommunityText MiningOnline Comment Analysis

google/civil_comments

The dataset comprises publicly available comments from the Civil Comments platform, which served as a commenting plugin for independent news websites. These comments were created between 2015 and 2017 and appeared on approximately 50 English-language news sites worldwide. When Civil Comments shut down in 2017, the public comments were preserved in a permanent open archive for future research. The original data include the comment text, some associated metadata (e.g., article ID, timestamp, and the commenter‑generated “civil” label), but omit user IDs. Jigsaw extended this dataset by adding additional toxicity and identity‑mention labels. This dataset is an exact replica of the data used in Jigsaw’s “Unintended Bias in Toxicity Classification” challenge on Kaggle. Both the dataset and the underlying comment texts are released under the CC0 license.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 25, 2024
Signals
330 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • Name: Civil Comments
  • License: CC0-1.0
  • Tags: toxic-comment-classification
  • Task Category: text-classification
  • Task ID: multi-label-classification

Dataset Structure

Data Features

  • text: string type
  • toxicity: float32
  • severe_toxicity: float32
  • obscene: float32
  • threat: float32
  • insult: float32
  • identity_attack: float32
  • sexual_explicit: float32

Data Splits

  • Training set: 1,804,874 samples
  • Validation set: 97,320 samples
  • Test set: 97,320 samples

Dataset Creation

License Information

  • License: CC0 1.0

Citation Information

@article{DBLP:journals/corr/abs-1903-04561, author = {Daniel Borkan and Lucas Dixon and Jeffrey Sorensen and Nithum Thain and Lucy Vasserman}, title = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification}, journal = {CoRR}, volume = {abs/1903.04561}, year = {2019}, url = {http://arxiv.org/abs/1903.04561}, archivePrefix = {arXiv}, eprint = {1903.04561}, timestamp = {Sun, 31 Mar 2019 19:01:24 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561}, bibsource = {dblp computer science bibliography, https://dblp.org} }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio