Back to datasets
Dataset assetOpen Source CommunityCybersecurityMachine Learning

CICIDS2017

The CICIDS2017 dataset is used for cybersecurity tasks and contains several days of network traffic data for malicious traffic detection. The data have been read, cleaned, merged, and a random‑forest model has been applied for classification.

Source
github
Created
Aug 11, 2024
Updated
Aug 23, 2024
Signals
794 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

This project employs the CICIDS2017 network‑traffic dataset for machine‑learning‑based malicious traffic identification.

Libraries Used

  • pandas – data manipulation, CSV I/O
  • seaborn – data visualization
  • matplotlib.pyplot – data visualization
  • sklearn – machine‑learning algorithms
  • numpy – numerical operations

Data Pre‑processing

  1. Import Libraries – load pandas, seaborn, matplotlib, etc.
  2. Read Data – use pandas.read_csv() to load CSV files for different dates.
  3. Data Cleaning:
    • Handle missing values (e.g., drop rows or impute with mean/median).
    • Remove irrelevant columns.
    • Encode categorical features (e.g., label encoding).
    • Reduce memory usage (e.g., adjust dtypes).
    • Drop features with a single unique value.
  4. Dimensionality Reduction (optional) – apply PCA or t‑SNE for visualization.
  5. Feature Selection – analyze feature importance and select relevant features for model training.

Exploratory Data Analysis (EDA)

  1. Distribution Analysis – use bar charts, histograms to explore feature distributions across traffic types (benign, DoS, etc.).
  2. Correlation Analysis – heatmaps to identify relationships between features and target variable.
  3. Class Imbalance – assess imbalance in traffic types; if present, apply oversampling techniques such as SMOTE.

Model Training

  1. Data Split – use train_test_split to divide data into training and testing sets.
  2. Model Selection – choose classification models (e.g., Random Forest). GPU‑accelerated options like cuml may be considered.
  3. Training – train the model on the training set.
  4. Evaluation – assess performance using accuracy, confusion matrix, classification report.

Results & Discussion

  1. Present Results – report accuracy, confusion matrix, classification report.
  2. Discuss Findings – analyze strengths, weaknesses, and impact of feature selection.
  3. Future Work – outline potential improvements and further research directions.

Code Structure

  • The code is organized by functionality: data preprocessing, feature engineering, model training, and evaluation, with comments explaining each section.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio