DATASET
Open Source Community
phishing_dataset
Collected 500 phishing sites from PhishTank and 500 legitimate sites from Alexa. The dataset is split with 70% for training and 30% for testing.
Updated 12/16/2023
github
Description
phishing_dataset Dataset Overview
Dataset Composition
- Collected 500 phishing sites sourced from PhishTank.
- Collected 500 legitimate sites sourced from Alexa.
- The dataset is split with a 70%/30% train‑test ratio.
Feature Description
URL Features
- Domain similarity: Similarity between the accessed website's domain and the domain of URLs obtained from Alexa or PhishTank, computed using the Ratcliff‑Obershelp algorithm.
- URL length: Number of characters in the URL.
- HTTP protocol type: Standard (0) or secure (1).
- Number of '.' characters: Count of dot symbols in the URL.
- Number of '/' characters: Count of slash symbols in the URL.
- Number of '//' sequences: Count of double‑slash symbols in the URL.
- Number of '-' characters: Count of hyphen symbols.
- Number of '_' characters: Count of underscore symbols.
- Number of '=' characters: Count of equal signs.
- Number of '(' and ')' characters: Count of parentheses.
- Number of '{' and '}' characters: Count of curly braces.
- Number of '[' and ']' characters: Count of square brackets.
- Number of '<' and '>' characters: Count of angle brackets.
- Number of '~' characters: Count of tilde symbols.
- Number of '*' characters: Count of asterisks.
- Number of '+' characters: Count of plus signs.
- Presence of '@' symbol: Whether the URL contains '@' (1 = yes, 0 = no).
- Presence of IP address: Whether the URL contains an IP address (1 = yes, 0 = no).
HTML Features
- Number of tags: Count of tags used to create hyperlinks or anchor links.
- Number of tags: Count of tags used for various form elements.
- Number of
- Number of tags: Count of tags for linking external resources such as stylesheets, icons, etc.
- Number of : Count of <iFrame> tags for embedding external resources such as other HTML documents, video, etc.
HTTP Features
- HTTP response history: HTTP response codes returned by the server, indicating the result of client requests.
- Redirect status: Whether the site redirects to another site (1 = redirect, 0 = no redirect), detected via HTTP redirect response codes.
References
- Kapan, S.; Sora Gunal, E. Improved Phishing Attack Detection with Machine Learning: A Comprehensive Evaluation of Classifiers and Features. Appl. Sci. 2023, 13, 13269.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Phishing Website Detection
Cybersecurity
Source
Organization: github
Created: 1/9/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.