Back to datasets
Dataset assetOpen Source CommunityCybersecurityMachine Learning
Phishing and Benign URLs Dataset
The dataset contains 5,000 phishing URLs and 5,000 legitimate URLs for training machine learning models to predict phishing websites. Features of URLs and site content such as domain, IP, URL length, etc., were extracted, resulting in a dataset with 18 features.
Source
github
Created
Dec 9, 2021
Updated
Dec 9, 2021
Signals
251 views
Availability
Linked source ready
Overview
Dataset description and usage context
Data Set Overview
Data Set Name
Phishing-Website-Detection
Data Set Purpose
Used to train machine learning models and deep neural networks to predict phishing websites.
Data Set Content
- Contains 5,000 phishing URLs and 5,000 legitimate URLs.
- Extracted features include: Domain, Have_IP, Have_At, URL_Length, URL_Depth, Redirection, https, Tiny_URL, Prefix/Suffix, DNS_Record, Web_Traffic, Domain_Age, Domain_End, iFrame, Mouse_Over, Right_Click, Web_Forwards and Label.
Data Processing
- Preprocessing includes feature extraction and data cleaning.
- Dataset split into 80% training and 20% testing.
Usage
- Two Python scripts: one for data preparation, another for implementing and comparing ML algorithms.
Evaluation
- Six ML algorithms: XGboost, Multilayer Perceptrons, Random Forest, Decision Tree, SVM, AutoEncoder.
- Model performance evaluated via predictions on training and test sets, with confusion matrix and visualizations for accuracy comparison.
Conclusion
- XGboost algorithm performs best on the dataset, achieving the highest accuracy.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.