talby/spamassassin
The SpamAssassin public email corpus is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems. The dataset contains various email samples divided into spam and ham categories, with further sub‑groups such as hard_ham, spam_2, spam, easy_ham, and easy_ham_2. Structure includes fields like label, group, text, and raw; only a training split is provided.
Description
Dataset Overview
Dataset Name
SpamAssassin Public Email Corpus
Dataset Description
This is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems.
Dataset Structure
Data Instances
textconfiguration normalizes all character sets to UTF‑8 and dumps the MIME tree as a list of JSON lists.unprocessedconfiguration leaves the message unparsed, preserving the full header and body in binary format.
Data Fields
label: marked asspamorhamgroup: samples are classified by SpamAssassin into {hard_ham, spam_2, spam, easy_ham, easy_ham_2}text: normalized email body textraw: full binary header and body of the email
Data Split
Only the train split is provided.
Dataset Creation
Selection Rationale
The dataset is intended to help verify whether modern NLP tools can solve legacy NLP problems.
Source Data
Initial Data Collection and Normalization
The upstream corpus description details the collection method. Text body reconstruction primarily uses email.parser and ftfy.
License
Unknown
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.