Explore high-quality datasets for your AI and machine learning projects.
After the Enron scandal in the United States, the Federal Energy Regulatory Commission released a dataset of 600,000 emails from 158 employees. The dataset was later purchased and processed by MIT, with some attachments removed. Different versions of the dataset remain available at the Library of Congress and specific websites. A commonly used subset was created by researchers at the Institute of Informatics and Telecommunications of Greece for analyzing and testing various spam filters, including several Naïve Bayes versions. The current CSV file contains this specific subset, comprising 33,716 emails, of which 17,171 are spam. The file includes a concatenated subject‑and‑body field and a separate column for the original filename.
This dataset consists of a series of email messages and serves as training and testing data. Because the dataset is used for a competition, the test set is unlabeled while only the training set is labeled. The training set contains 2,929 legitimate (ham) emails and 1,378 spam emails.