andyP/fake_news_en_opensources
The Fake News Opensources dataset is a curated and cleaned version of the opensources fake‑news collection, containing 5 915 569 articles divided into 12 categories. It is suitable for text‑classification tasks such as topic classification and fact‑checking. The dataset is monolingual (English) and released under the Apache‑2.0 license. It includes fields such as id, type, domain, scraped_at, url, authors, title, and content.
Description
Dataset Card: "Fake News Opensources"
Dataset Description
Dataset Summary
"Fake News Opensources" is an integrated and cleaned version of the opensources Fake News dataset. It originally contained 8 529 090 articles across 12 categories: reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, and unknown. Articles were scraped from the end of 2017 to early 2018 from 647 different sources, covering the period from the 2016 U.S. election onward. After extensive cleaning and deduplication, 5 915 569 records remain.
Supported Tasks and Leaderboards
- Text Classification
- Fact‑Checking
Language
English
Dataset Structure
Data Instances
An example record:
{
"id": 4059480,
"type": "political",
"domain": "dailycaller.com",
"scraped_at": "2017-11-27",
"url": "http://dailycaller.com/buzz/massachusettsunited-states/page/2/",
"authors": "Jeff Winkler, Jonathan Strong, Ken Blackwell, Pat Mcmahon, Julia Mcclatchy, Admin, Matt Purple",
"title": "The Daily Caller",
"content": "New Hampshire is the state with the highest median income in the nation, according to the U.S. Census Bureau’s report on income, poverty and health insurance"
}
Data Fields
id: Unique article identifiertype: Label of the record (reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, unknown)scraped_at: Original scrape dateurl: Original article URLauthors: List of scraped authors, comma‑separatedtitle: Original article titlecontent: Full article body
Data Split
| Category | Records |
|---|---|
| reliable | 1 807 323 |
| political | 968 205 |
| bias | 769 874 |
| fake | 762 178 |
| conspiracy | 494 184 |
| rumor | 375 963 |
| unknown | 230 532 |
| clickbait | 174 176 |
| unreliable | 104 537 |
| satire | 84 735 |
| junksci | 79 099 |
| hate | 64 763 |
| Total | 5 915 569 |
Dataset Creation
Source Data
News articles were collected from a variety of websites.
Who produced the source language?
News articles, blogs
Annotation
Who performed the annotation?
Journalists
Known Limitations
The dataset has not been manually screened, so some labels may be incorrect and some URLs may not point to the actual article but to other pages on the site. However, because the corpus is intended for training machine‑learning algorithms, these issues should not pose a practical problem.
When the dataset is finalized (currently only ~80 % cleaned and released), it will not be updated further, so it may become outdated for uses beyond content‑based algorithms. Contributions are welcome.
License Information
The dataset is provided and distributed under the Apache‑2.0 license.
Citation Information
to be determined
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.