Back to datasets
Dataset assetOpen Source CommunityCybersecurityMachine Learning
CICIDS2017
The CICIDS2017 dataset is used for cybersecurity tasks and contains several days of network traffic data for malicious traffic detection. The data have been read, cleaned, merged, and a random‑forest model has been applied for classification.
Source
github
Created
Aug 11, 2024
Updated
Aug 23, 2024
Signals
794 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
This project employs the CICIDS2017 network‑traffic dataset for machine‑learning‑based malicious traffic identification.
Libraries Used
- pandas – data manipulation, CSV I/O
- seaborn – data visualization
- matplotlib.pyplot – data visualization
- sklearn – machine‑learning algorithms
- numpy – numerical operations
Data Pre‑processing
- Import Libraries – load pandas, seaborn, matplotlib, etc.
- Read Data – use
pandas.read_csv()to load CSV files for different dates. - Data Cleaning:
- Handle missing values (e.g., drop rows or impute with mean/median).
- Remove irrelevant columns.
- Encode categorical features (e.g., label encoding).
- Reduce memory usage (e.g., adjust dtypes).
- Drop features with a single unique value.
- Dimensionality Reduction (optional) – apply PCA or t‑SNE for visualization.
- Feature Selection – analyze feature importance and select relevant features for model training.
Exploratory Data Analysis (EDA)
- Distribution Analysis – use bar charts, histograms to explore feature distributions across traffic types (benign, DoS, etc.).
- Correlation Analysis – heatmaps to identify relationships between features and target variable.
- Class Imbalance – assess imbalance in traffic types; if present, apply oversampling techniques such as SMOTE.
Model Training
- Data Split – use
train_test_splitto divide data into training and testing sets. - Model Selection – choose classification models (e.g., Random Forest). GPU‑accelerated options like
cumlmay be considered. - Training – train the model on the training set.
- Evaluation – assess performance using accuracy, confusion matrix, classification report.
Results & Discussion
- Present Results – report accuracy, confusion matrix, classification report.
- Discuss Findings – analyze strengths, weaknesses, and impact of feature selection.
- Future Work – outline potential improvements and further research directions.
Code Structure
- The code is organized by functionality: data preprocessing, feature engineering, model training, and evaluation, with comments explaining each section.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.