Back to datasets
Dataset assetOpen Source CommunityText ClassificationFake News Detection

andyP/fake_news_en_opensources

The Fake News Opensources dataset is a curated and cleaned version of the opensources fake‑news collection, containing 5 915 569 articles divided into 12 categories. It is suitable for text‑classification tasks such as topic classification and fact‑checking. The dataset is monolingual (English) and released under the Apache‑2.0 license. It includes fields such as id, type, domain, scraped_at, url, authors, title, and content.

Source
hugging_face
Created
Nov 28, 2025
Updated
Feb 12, 2024
Signals
127 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Card: "Fake News Opensources"

Dataset Description

Dataset Summary

"Fake News Opensources" is an integrated and cleaned version of the opensources Fake News dataset. It originally contained 8 529 090 articles across 12 categories: reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, and unknown. Articles were scraped from the end of 2017 to early 2018 from 647 different sources, covering the period from the 2016 U.S. election onward. After extensive cleaning and deduplication, 5 915 569 records remain.

Supported Tasks and Leaderboards

  • Text Classification
  • Fact‑Checking

Language

English

Dataset Structure

Data Instances

An example record:

{
  "id": 4059480,
  "type": "political",
  "domain": "dailycaller.com",
  "scraped_at": "2017-11-27",
  "url": "http://dailycaller.com/buzz/massachusettsunited-states/page/2/",
  "authors": "Jeff Winkler, Jonathan Strong, Ken Blackwell, Pat Mcmahon, Julia Mcclatchy, Admin, Matt Purple",
  "title": "The Daily Caller",
  "content": "New Hampshire is the state with the highest median income in the nation, according to the U.S. Census Bureau’s report on income, poverty and health insurance"
}

Data Fields

  • id: Unique article identifier
  • type: Label of the record (reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, unknown)
  • scraped_at: Original scrape date
  • url: Original article URL
  • authors: List of scraped authors, comma‑separated
  • title: Original article title
  • content: Full article body

Data Split

CategoryRecords
reliable1 807 323
political968 205
bias769 874
fake762 178
conspiracy494 184
rumor375 963
unknown230 532
clickbait174 176
unreliable104 537
satire84 735
junksci79 099
hate64 763
Total5 915 569

Dataset Creation

Source Data

News articles were collected from a variety of websites.

Who produced the source language?

News articles, blogs

Annotation

Who performed the annotation?

Journalists

Known Limitations

The dataset has not been manually screened, so some labels may be incorrect and some URLs may not point to the actual article but to other pages on the site. However, because the corpus is intended for training machine‑learning algorithms, these issues should not pose a practical problem.

When the dataset is finalized (currently only ~80 % cleaned and released), it will not be updated further, so it may become outdated for uses beyond content‑based algorithms. Contributions are welcome.

License Information

The dataset is provided and distributed under the Apache‑2.0 license.

Citation Information

to be determined

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio