CSDMC2010 SPAM corpus
This dataset consists of a series of email messages and serves as training and testing data. Because the dataset is used for a competition, the test set is unlabeled while only the training set is labeled. The training set contains 2,929 legitimate (ham) emails and 1,378 spam emails.
Description
Dataset Overview
Dataset Name
Spam-Email-Classifier-DataSet
Original Dataset
CSDMC2010 SPAM corpus
Dataset Contents
- Training Data: Includes 2,929 legitimate emails (ham) and 1,378 spam emails.
- Test Data: Unlabeled.
Data Processing Tools
- convert.py: Removes HTML tags from .eml files.
- move.sh: Moves emails to "./ham/" and "./spam" folders according to their labels.
Output
- ham.zip
- spam.zip
Issues
Some meaningless symbols (e.g., < or >) remain in the files.
Suggestions for Improvement
Encourage proposals of suggestions and improvement measures.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 9/18/2016
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.