DATASET
Open Source Community
CICIDS2017
The CICIDS2017 dataset is used for cybersecurity tasks and contains several days of network traffic data for malicious traffic detection. The data have been read, cleaned, merged, and a random‑forest model has been applied for classification.
Updated 8/23/2024
github
Description
Dataset Overview
This project employs the CICIDS2017 network‑traffic dataset for machine‑learning‑based malicious traffic identification.
Libraries Used
- pandas – data manipulation, CSV I/O
- seaborn – data visualization
- matplotlib.pyplot – data visualization
- sklearn – machine‑learning algorithms
- numpy – numerical operations
Data Pre‑processing
- Import Libraries – load pandas, seaborn, matplotlib, etc.
- Read Data – use
pandas.read_csv()to load CSV files for different dates. - Data Cleaning:
- Handle missing values (e.g., drop rows or impute with mean/median).
- Remove irrelevant columns.
- Encode categorical features (e.g., label encoding).
- Reduce memory usage (e.g., adjust dtypes).
- Drop features with a single unique value.
- Dimensionality Reduction (optional) – apply PCA or t‑SNE for visualization.
- Feature Selection – analyze feature importance and select relevant features for model training.
Exploratory Data Analysis (EDA)
- Distribution Analysis – use bar charts, histograms to explore feature distributions across traffic types (benign, DoS, etc.).
- Correlation Analysis – heatmaps to identify relationships between features and target variable.
- Class Imbalance – assess imbalance in traffic types; if present, apply oversampling techniques such as SMOTE.
Model Training
- Data Split – use
train_test_splitto divide data into training and testing sets. - Model Selection – choose classification models (e.g., Random Forest). GPU‑accelerated options like
cumlmay be considered. - Training – train the model on the training set.
- Evaluation – assess performance using accuracy, confusion matrix, classification report.
Results & Discussion
- Present Results – report accuracy, confusion matrix, classification report.
- Discuss Findings – analyze strengths, weaknesses, and impact of feature selection.
- Future Work – outline potential improvements and further research directions.
Code Structure
- The code is organized by functionality: data preprocessing, feature engineering, model training, and evaluation, with comments explaining each section.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Cybersecurity
Machine Learning
Source
Organization: github
Created: 8/11/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.