andyP/fake_news_en_opensources
The Fake News Opensources dataset is a curated and cleaned version of the opensources fake‑news collection, containing 5 915 569 articles divided into 12 categories. It is suitable for text‑classification tasks such as topic classification and fact‑checking. The dataset is monolingual (English) and released under the Apache‑2.0 license. It includes fields such as id, type, domain, scraped_at, url, authors, title, and content.
Dataset description and usage context
Dataset Card: "Fake News Opensources"
Dataset Description
Dataset Summary
"Fake News Opensources" is an integrated and cleaned version of the opensources Fake News dataset. It originally contained 8 529 090 articles across 12 categories: reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, and unknown. Articles were scraped from the end of 2017 to early 2018 from 647 different sources, covering the period from the 2016 U.S. election onward. After extensive cleaning and deduplication, 5 915 569 records remain.
Supported Tasks and Leaderboards
- Text Classification
- Fact‑Checking
Language
English
Dataset Structure
Data Instances
An example record:
{
"id": 4059480,
"type": "political",
"domain": "dailycaller.com",
"scraped_at": "2017-11-27",
"url": "http://dailycaller.com/buzz/massachusettsunited-states/page/2/",
"authors": "Jeff Winkler, Jonathan Strong, Ken Blackwell, Pat Mcmahon, Julia Mcclatchy, Admin, Matt Purple",
"title": "The Daily Caller",
"content": "New Hampshire is the state with the highest median income in the nation, according to the U.S. Census Bureau’s report on income, poverty and health insurance"
}
Data Fields
id: Unique article identifiertype: Label of the record (reliable, unreliable, political, bias, fake, conspiracy, rumor, clickbait, junk science, satire, hate, unknown)scraped_at: Original scrape dateurl: Original article URLauthors: List of scraped authors, comma‑separatedtitle: Original article titlecontent: Full article body
Data Split
| Category | Records |
|---|---|
| reliable | 1 807 323 |
| political | 968 205 |
| bias | 769 874 |
| fake | 762 178 |
| conspiracy | 494 184 |
| rumor | 375 963 |
| unknown | 230 532 |
| clickbait | 174 176 |
| unreliable | 104 537 |
| satire | 84 735 |
| junksci | 79 099 |
| hate | 64 763 |
| Total | 5 915 569 |
Dataset Creation
Source Data
News articles were collected from a variety of websites.
Who produced the source language?
News articles, blogs
Annotation
Who performed the annotation?
Journalists
Known Limitations
The dataset has not been manually screened, so some labels may be incorrect and some URLs may not point to the actual article but to other pages on the site. However, because the corpus is intended for training machine‑learning algorithms, these issues should not pose a practical problem.
When the dataset is finalized (currently only ~80 % cleaned and released), it will not be updated further, so it may become outdated for uses beyond content‑based algorithms. Contributions are welcome.
License Information
The dataset is provided and distributed under the Apache‑2.0 license.
Citation Information
to be determined
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.