JUHE API Marketplace
DATASET
Open Source Community

CICIDS2017

The CICIDS2017 dataset is used for cybersecurity tasks and contains several days of network traffic data for malicious traffic detection. The data have been read, cleaned, merged, and a random‑forest model has been applied for classification.

Updated 8/23/2024
github

Description

Dataset Overview

This project employs the CICIDS2017 network‑traffic dataset for machine‑learning‑based malicious traffic identification.

Libraries Used

  • pandas – data manipulation, CSV I/O
  • seaborn – data visualization
  • matplotlib.pyplot – data visualization
  • sklearn – machine‑learning algorithms
  • numpy – numerical operations

Data Pre‑processing

  1. Import Libraries – load pandas, seaborn, matplotlib, etc.
  2. Read Data – use pandas.read_csv() to load CSV files for different dates.
  3. Data Cleaning:
    • Handle missing values (e.g., drop rows or impute with mean/median).
    • Remove irrelevant columns.
    • Encode categorical features (e.g., label encoding).
    • Reduce memory usage (e.g., adjust dtypes).
    • Drop features with a single unique value.
  4. Dimensionality Reduction (optional) – apply PCA or t‑SNE for visualization.
  5. Feature Selection – analyze feature importance and select relevant features for model training.

Exploratory Data Analysis (EDA)

  1. Distribution Analysis – use bar charts, histograms to explore feature distributions across traffic types (benign, DoS, etc.).
  2. Correlation Analysis – heatmaps to identify relationships between features and target variable.
  3. Class Imbalance – assess imbalance in traffic types; if present, apply oversampling techniques such as SMOTE.

Model Training

  1. Data Split – use train_test_split to divide data into training and testing sets.
  2. Model Selection – choose classification models (e.g., Random Forest). GPU‑accelerated options like cuml may be considered.
  3. Training – train the model on the training set.
  4. Evaluation – assess performance using accuracy, confusion matrix, classification report.

Results & Discussion

  1. Present Results – report accuracy, confusion matrix, classification report.
  2. Discuss Findings – analyze strengths, weaknesses, and impact of feature selection.
  3. Future Work – outline potential improvements and further research directions.

Code Structure

  • The code is organized by functionality: data preprocessing, feature engineering, model training, and evaluation, with comments explaining each section.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Cybersecurity
Machine Learning

Source

Organization: github

Created: 8/11/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.