Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingSpam Filtering

talby/spamassassin

The SpamAssassin public email corpus is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems. The dataset contains various email samples divided into spam and ham categories, with further sub‑groups such as hard_ham, spam_2, spam, easy_ham, and easy_ham_2. Structure includes fields like label, group, text, and raw; only a training split is provided.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jul 11, 2023
Signals
418 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

SpamAssassin Public Email Corpus

Dataset Description

This is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems.

Dataset Structure

Data Instances

  • text configuration normalizes all character sets to UTF‑8 and dumps the MIME tree as a list of JSON lists.
  • unprocessed configuration leaves the message unparsed, preserving the full header and body in binary format.

Data Fields

  • label: marked as spam or ham
  • group: samples are classified by SpamAssassin into {hard_ham, spam_2, spam, easy_ham, easy_ham_2}
  • text: normalized email body text
  • raw: full binary header and body of the email

Data Split

Only the train split is provided.

Dataset Creation

Selection Rationale

The dataset is intended to help verify whether modern NLP tools can solve legacy NLP problems.

Source Data

Initial Data Collection and Normalization

The upstream corpus description details the collection method. Text body reconstruction primarily uses email.parser and ftfy.

License

Unknown

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio