JUHE API Marketplace
DATASET
Open Source Community

andyP/fake_news_en_opensources

The Fake News Opensources dataset is a curated and cleaned version of the opensources fake‑news collection, containing 5 915 569 articles divided into 12 categories. It is suitable for text‑classification tasks such as topic classification and fact‑checking. The dataset is monolingual (English) and released under the Apache‑2.0 license. It includes fields such as id, type, domain, scraped_at, url, authors, title, and content.

Updated 2/12/2024
hugging_face

Description

Dataset Card: "Fake News Opensources"

Dataset Description

Dataset Summary

"Fake News Opensources" is an integrated and cleaned version of the opensources Fake News dataset. It originally contained 8 529 090 articles across 12 categories: reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, and unknown. Articles were scraped from the end of 2017 to early 2018 from 647 different sources, covering the period from the 2016 U.S. election onward. After extensive cleaning and deduplication, 5 915 569 records remain.

Supported Tasks and Leaderboards

  • Text Classification
  • Fact‑Checking

Language

English

Dataset Structure

Data Instances

An example record:

{
  "id": 4059480,
  "type": "political",
  "domain": "dailycaller.com",
  "scraped_at": "2017-11-27",
  "url": "http://dailycaller.com/buzz/massachusettsunited-states/page/2/",
  "authors": "Jeff Winkler, Jonathan Strong, Ken Blackwell, Pat Mcmahon, Julia Mcclatchy, Admin, Matt Purple",
  "title": "The Daily Caller",
  "content": "New Hampshire is the state with the highest median income in the nation, according to the U.S. Census Bureau’s report on income, poverty and health insurance"
}

Data Fields

  • id: Unique article identifier
  • type: Label of the record (reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, unknown)
  • scraped_at: Original scrape date
  • url: Original article URL
  • authors: List of scraped authors, comma‑separated
  • title: Original article title
  • content: Full article body

Data Split

CategoryRecords
reliable1 807 323
political968 205
bias769 874
fake762 178
conspiracy494 184
rumor375 963
unknown230 532
clickbait174 176
unreliable104 537
satire84 735
junksci79 099
hate64 763
Total5 915 569

Dataset Creation

Source Data

News articles were collected from a variety of websites.

Who produced the source language?

News articles, blogs

Annotation

Who performed the annotation?

Journalists

Known Limitations

The dataset has not been manually screened, so some labels may be incorrect and some URLs may not point to the actual article but to other pages on the site. However, because the corpus is intended for training machine‑learning algorithms, these issues should not pose a practical problem.

When the dataset is finalized (currently only ~80 % cleaned and released), it will not be updated further, so it may become outdated for uses beyond content‑based algorithms. Contributions are welcome.

License Information

The dataset is provided and distributed under the Apache‑2.0 license.

Citation Information

to be determined

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Fake News Detection
Text Classification

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.