CSDMC2010 SPAM corpus

This dataset consists of a series of email messages and serves as training and testing data. Because the dataset is used for a competition, the test set is unlabeled while only the training set is labeled. The training set contains 2,929 legitimate (ham) emails and 1,378 spam emails.

Updated 4/29/2024

github

Description

Dataset Overview

Dataset Name

Spam-Email-Classifier-DataSet

Original Dataset

CSDMC2010 SPAM corpus

Dataset Contents

Training Data: Includes 2,929 legitimate emails (ham) and 1,378 spam emails.
Test Data: Unlabeled.

Data Processing Tools

convert.py: Removes HTML tags from .eml files.
move.sh: Moves emails to "./ham/" and "./spam" folders according to their labels.

Output

ham.zip
spam.zip

Issues

Some meaningless symbols (e.g., < or >) remain in the files.

Suggestions for Improvement

Encourage proposals of suggestions and improvement measures.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Spam Detection

Data Classification

Source

Organization: github

Created: 9/18/2016

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →