JUHE API Marketplace
DATASET
Open Source Community

talby/spamassassin

The SpamAssassin public email corpus is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems. The dataset contains various email samples divided into spam and ham categories, with further sub‑groups such as hard_ham, spam_2, spam, easy_ham, and easy_ham_2. Structure includes fields like label, group, text, and raw; only a training split is provided.

Updated 7/11/2023
hugging_face

Description

Dataset Overview

Dataset Name

SpamAssassin Public Email Corpus

Dataset Description

This is a collection of email messages assembled by members of the SpamAssassin project, suitable for testing spam‑filtering systems.

Dataset Structure

Data Instances

  • text configuration normalizes all character sets to UTF‑8 and dumps the MIME tree as a list of JSON lists.
  • unprocessed configuration leaves the message unparsed, preserving the full header and body in binary format.

Data Fields

  • label: marked as spam or ham
  • group: samples are classified by SpamAssassin into {hard_ham, spam_2, spam, easy_ham, easy_ham_2}
  • text: normalized email body text
  • raw: full binary header and body of the email

Data Split

Only the train split is provided.

Dataset Creation

Selection Rationale

The dataset is intended to help verify whether modern NLP tools can solve legacy NLP problems.

Source Data

Initial Data Collection and Normalization

The upstream corpus description details the collection method. Text body reconstruction primarily uses email.parser and ftfy.

License

Unknown

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Spam Filtering
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.