JUHE API Marketplace
DATASET
Open Source Community

phishing_dataset

Collected 500 phishing sites from PhishTank and 500 legitimate sites from Alexa. The dataset is split with 70% for training and 30% for testing.

Updated 12/16/2023
github

Description

phishing_dataset Dataset Overview

Dataset Composition

  • Collected 500 phishing sites sourced from PhishTank.
  • Collected 500 legitimate sites sourced from Alexa.
  • The dataset is split with a 70%/30% train‑test ratio.

Feature Description

URL Features

  1. Domain similarity: Similarity between the accessed website's domain and the domain of URLs obtained from Alexa or PhishTank, computed using the Ratcliff‑Obershelp algorithm.
  2. URL length: Number of characters in the URL.
  3. HTTP protocol type: Standard (0) or secure (1).
  4. Number of '.' characters: Count of dot symbols in the URL.
  5. Number of '/' characters: Count of slash symbols in the URL.
  6. Number of '//' sequences: Count of double‑slash symbols in the URL.
  7. Number of '-' characters: Count of hyphen symbols.
  8. Number of '_' characters: Count of underscore symbols.
  9. Number of '=' characters: Count of equal signs.
  10. Number of '(' and ')' characters: Count of parentheses.
  11. Number of '{' and '}' characters: Count of curly braces.
  12. Number of '[' and ']' characters: Count of square brackets.
  13. Number of '<' and '>' characters: Count of angle brackets.
  14. Number of '~' characters: Count of tilde symbols.
  15. Number of '*' characters: Count of asterisks.
  16. Number of '+' characters: Count of plus signs.
  17. Presence of '@' symbol: Whether the URL contains '@' (1 = yes, 0 = no).
  18. Presence of IP address: Whether the URL contains an IP address (1 = yes, 0 = no).

HTML Features

  1. Number of tags: Count of tags used to create hyperlinks or anchor links.
  2. Number of tags: Count of tags used for various form elements.
  3. Number of
  4. Number of tags: Count of tags for linking external resources such as stylesheets, icons, etc.
  5. Number of : Count of <iFrame> tags for embedding external resources such as other HTML documents, video, etc.

HTTP Features

  1. HTTP response history: HTTP response codes returned by the server, indicating the result of client requests.
  2. Redirect status: Whether the site redirects to another site (1 = redirect, 0 = no redirect), detected via HTTP redirect response codes.

References

  • Kapan, S.; Sora Gunal, E. Improved Phishing Attack Detection with Machine Learning: A Comprehensive Evaluation of Classifiers and Features. Appl. Sci. 2023, 13, 13269.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Phishing Website Detection
Cybersecurity

Source

Organization: github

Created: 1/9/2023

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.