Dataset assetOpen Source CommunityCybersecurityMachine Learning

Phishing and Benign URLs Dataset

The dataset contains 5,000 phishing URLs and 5,000 legitimate URLs for training machine learning models to predict phishing websites. Features of URLs and site content such as domain, IP, URL length, etc., were extracted, resulting in a dataset with 18 features.

Source

github

Created

Dec 9, 2021

Updated

Dec 9, 2021

Signals

251 views

Availability

Linked source ready

Overview

Dataset description and usage context

Data Set Overview

Data Set Name

Phishing-Website-Detection

Data Set Purpose

Used to train machine learning models and deep neural networks to predict phishing websites.

Data Set Content

Contains 5,000 phishing URLs and 5,000 legitimate URLs.
Extracted features include: Domain, Have_IP, Have_At, URL_Length, URL_Depth, Redirection, https, Tiny_URL, Prefix/Suffix, DNS_Record, Web_Traffic, Domain_Age, Domain_End, iFrame, Mouse_Over, Right_Click, Web_Forwards and Label.

Data Processing

Preprocessing includes feature extraction and data cleaning.
Dataset split into 80% training and 20% testing.

Usage

Two Python scripts: one for data preparation, another for implementing and comparing ML algorithms.

Evaluation

Six ML algorithms: XGboost, Multilayer Perceptrons, Random Forest, Decision Tree, SVM, AutoEncoder.
Model performance evaluated via predictions on training and test sets, with confusion matrix and visualizations for accuracy comparison.

Conclusion

XGboost algorithm performs best on the dataset, achieving the highest accuracy.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio