google/civil_comments
The dataset comprises publicly available comments from the Civil Comments platform, which served as a commenting plugin for independent news websites. These comments were created between 2015 and 2017 and appeared on approximately 50 English-language news sites worldwide. When Civil Comments shut down in 2017, the public comments were preserved in a permanent open archive for future research. The original data include the comment text, some associated metadata (e.g., article ID, timestamp, and the commenter‑generated “civil” label), but omit user IDs. Jigsaw extended this dataset by adding additional toxicity and identity‑mention labels. This dataset is an exact replica of the data used in Jigsaw’s “Unintended Bias in Toxicity Classification” challenge on Kaggle. Both the dataset and the underlying comment texts are released under the CC0 license.
Dataset description and usage context
Dataset Overview
Basic Information
- Name: Civil Comments
- License: CC0-1.0
- Tags: toxic-comment-classification
- Task Category: text-classification
- Task ID: multi-label-classification
Dataset Structure
Data Features
- text: string type
- toxicity: float32
- severe_toxicity: float32
- obscene: float32
- threat: float32
- insult: float32
- identity_attack: float32
- sexual_explicit: float32
Data Splits
- Training set: 1,804,874 samples
- Validation set: 97,320 samples
- Test set: 97,320 samples
Dataset Creation
License Information
- License: CC0 1.0
Citation Information
@article{DBLP:journals/corr/abs-1903-04561, author = {Daniel Borkan and Lucas Dixon and Jeffrey Sorensen and Nithum Thain and Lucy Vasserman}, title = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification}, journal = {CoRR}, volume = {abs/1903.04561}, year = {2019}, url = {http://arxiv.org/abs/1903.04561}, archivePrefix = {arXiv}, eprint = {1903.04561}, timestamp = {Sun, 31 Mar 2019 19:01:24 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.