bvk/ENRON-spam
After the Enron scandal in the United States, the Federal Energy Regulatory Commission released a dataset of 600,000 emails from 158 employees. The dataset was later purchased and processed by MIT, with some attachments removed. Different versions of the dataset remain available at the Library of Congress and specific websites. A commonly used subset was created by researchers at the Institute of Informatics and Telecommunications of Greece for analyzing and testing various spam filters, including several Naïve Bayes versions. The current CSV file contains this specific subset, comprising 33,716 emails, of which 17,171 are spam. The file includes a concatenated subject‑and‑body field and a separate column for the original filename.
Description
Enron Email Dataset
Overview
- Source: The dataset originates from 600,000 emails released by the U.S. Federal Energy Regulatory Commission, involving 158 employees. It was later purchased and processed by MIT, with some attachments deleted or edited.
- Versions: Versions of the dataset are available at the Library of Congress and https://www.cs.cmu.edu/~./enron/.
Subset
- Subset Source: Multiple subsets of the dataset can be found online, including on GitHub, HuggingFace, and Kaggle.
- Specific Subset: Researchers from the Institute of Informatics and Telecommunications of Greece described a commonly used subset in their paper [Metsis]. This subset selected six Enron employees with large email volumes, containing 33,716 emails, of which 17,171 are spam.
Data Content
- File Format: CSV file.
- Fields: Includes a concatenated subject‑and‑body field and a separate original filename column.
Research Purpose
- Research Direction: Used to analyze and test various spam filters, including multiple Naïve Bayes versions.
References
- [Metsis] Metsis, V., Androutsopoulos, I., & Paliouras, G. "Spam filtering with naive bayes‑which naive bayes?" Proceedings of the 3rd Conference on Email and Anti‑Spam (CEAS 2006), Mountain View, CA, USA, 2006.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.