Back to datasets
Dataset assetOpen Source CommunityCybersecurityMachine Learning

Phishing and Benign URLs Dataset

The dataset contains 5,000 phishing URLs and 5,000 legitimate URLs for training machine learning models to predict phishing websites. Features of URLs and site content such as domain, IP, URL length, etc., were extracted, resulting in a dataset with 18 features.

Source
github
Created
Dec 9, 2021
Updated
Dec 9, 2021
Signals
251 views
Availability
Linked source ready
Overview

Dataset description and usage context

Data Set Overview

Data Set Name

Phishing-Website-Detection

Data Set Purpose

Used to train machine learning models and deep neural networks to predict phishing websites.

Data Set Content

  • Contains 5,000 phishing URLs and 5,000 legitimate URLs.
  • Extracted features include: Domain, Have_IP, Have_At, URL_Length, URL_Depth, Redirection, https, Tiny_URL, Prefix/Suffix, DNS_Record, Web_Traffic, Domain_Age, Domain_End, iFrame, Mouse_Over, Right_Click, Web_Forwards and Label.

Data Processing

  • Preprocessing includes feature extraction and data cleaning.
  • Dataset split into 80% training and 20% testing.

Usage

  • Two Python scripts: one for data preparation, another for implementing and comparing ML algorithms.

Evaluation

  • Six ML algorithms: XGboost, Multilayer Perceptrons, Random Forest, Decision Tree, SVM, AutoEncoder.
  • Model performance evaluated via predictions on training and test sets, with confusion matrix and visualizations for accuracy comparison.

Conclusion

  • XGboost algorithm performs best on the dataset, achieving the highest accuracy.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio