Back to datasets
Dataset assetOpen Source CommunityCode Smell DetectionCode Quality Analysis

MLCQ

The MLCQ dataset is used for code smell detection experiments and contains code snippets along with relevant code metrics.

Source
github
Created
Nov 26, 2024
Updated
Nov 26, 2024
Signals
174 views
Availability
Linked source ready
Overview

Dataset description and usage context

MLCQ‑Experiments Dataset Overview

Source

Purpose

  • Supports the paper "Exploring NLP Techniques for Code Smell Detection: A Comparative Study".
  • Enables comparison of NLP‑based models with baseline approaches for detecting code smells.

Processing Steps

  1. Environment Setup:
    • Create a conda environment and install dependencies.
    • Commands:
    conda create -n mlcqenv python=3.10
    conda activate mlcqenv
    conda install -f requirements.txt
    
  2. Data Extraction:
    • Set a GitHub token for API access.
    • Export token: export GITHUB_TOKEN=<your_github_token>
    • Run the extractor: python DataExtractor.py
  3. Baseline Model:
    • Use the J48 decision‑tree as baseline.
    • Compute code metrics with the Designite tool.
    • Steps:
      1. python baseline/MetricsExtractor.py (prepare .java files)
      2. python baseline/DesigniteRun.py (run Designite)
      3. python baseline/DatasetCreator.py (final dataset)
      4. python train.py (train & test)
  4. Model Training:
    • Train BiLSTM with Attention:
    python bilstm_attn_train.py --batch_size 16 --epochs 20 --learning_rate 0.0001 --hidden_dim 512 --num_layers 2
    
    • Train CodeBERT:
    python bert.py
    

Dependencies

Authors

  • Djamel Mesbah
  • Nour El Madhoun
  • Hani Chalouati
  • Khaldoun Al Agha
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio