Dataset assetOpen Source CommunityText ClassificationFake News Detection

andyP/fake_news_en_opensources

The Fake News Opensources dataset is a curated and cleaned version of the opensources fake‑news collection, containing 5 915 569 articles divided into 12 categories. It is suitable for text‑classification tasks such as topic classification and fact‑checking. The dataset is monolingual (English) and released under the Apache‑2.0 license. It includes fields such as id, type, domain, scraped_at, url, authors, title, and content.

Source

hugging_face

Created

Nov 28, 2025

Updated

Feb 12, 2024

Signals

127 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Card: "Fake News Opensources"

Dataset Description

Dataset Summary

"Fake News Opensources" is an integrated and cleaned version of the opensources Fake News dataset. It originally contained 8 529 090 articles across 12 categories: reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, and unknown. Articles were scraped from the end of 2017 to early 2018 from 647 different sources, covering the period from the 2016 U.S. election onward. After extensive cleaning and deduplication, 5 915 569 records remain.

Supported Tasks and Leaderboards

Text Classification
Fact‑Checking

Language

English

Dataset Structure

Data Instances

An example record:

{
  "id": 4059480,
  "type": "political",
  "domain": "dailycaller.com",
  "scraped_at": "2017-11-27",
  "url": "http://dailycaller.com/buzz/massachusettsunited-states/page/2/",
  "authors": "Jeff Winkler, Jonathan Strong, Ken Blackwell, Pat Mcmahon, Julia Mcclatchy, Admin, Matt Purple",
  "title": "The Daily Caller",
  "content": "New Hampshire is the state with the highest median income in the nation, according to the U.S. Census Bureau’s report on income, poverty and health insurance"
}

Data Fields

id: Unique article identifier
type: Label of the record (reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, unknown)
scraped_at: Original scrape date
url: Original article URL
authors: List of scraped authors, comma‑separated
title: Original article title
content: Full article body

Data Split

Category	Records
reliable	1 807 323
political	968 205
bias	769 874
fake	762 178
conspiracy	494 184
rumor	375 963
unknown	230 532
clickbait	174 176
unreliable	104 537
satire	84 735
junksci	79 099
hate	64 763
Total	5 915 569

Dataset Creation

Source Data

News articles were collected from a variety of websites.

Who produced the source language?

News articles, blogs

Annotation

Who performed the annotation?

Journalists

Known Limitations

The dataset has not been manually screened, so some labels may be incorrect and some URLs may not point to the actual article but to other pages on the site. However, because the corpus is intended for training machine‑learning algorithms, these issues should not pose a practical problem.

When the dataset is finalized (currently only ~80 % cleaned and released), it will not be updated further, so it may become outdated for uses beyond content‑based algorithms. Contributions are welcome.

License Information

The dataset is provided and distributed under the Apache‑2.0 license.

Citation Information

to be determined

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio