JUHE API Marketplace
DATASET
Open Source Community

MLCQ

The MLCQ dataset is used for code smell detection experiments and contains code snippets along with relevant code metrics.

Updated 11/26/2024
github

Description

MLCQ‑Experiments Dataset Overview

Source

Purpose

  • Supports the paper "Exploring NLP Techniques for Code Smell Detection: A Comparative Study".
  • Enables comparison of NLP‑based models with baseline approaches for detecting code smells.

Processing Steps

  1. Environment Setup:
    • Create a conda environment and install dependencies.
    • Commands:
    conda create -n mlcqenv python=3.10
    conda activate mlcqenv
    conda install -f requirements.txt
    
  2. Data Extraction:
    • Set a GitHub token for API access.
    • Export token: export GITHUB_TOKEN=<your_github_token>
    • Run the extractor: python DataExtractor.py
  3. Baseline Model:
    • Use the J48 decision‑tree as baseline.
    • Compute code metrics with the Designite tool.
    • Steps:
      1. python baseline/MetricsExtractor.py (prepare .java files)
      2. python baseline/DesigniteRun.py (run Designite)
      3. python baseline/DatasetCreator.py (final dataset)
      4. python train.py (train & test)
  4. Model Training:
    • Train BiLSTM with Attention:
    python bilstm_attn_train.py --batch_size 16 --epochs 20 --learning_rate 0.0001 --hidden_dim 512 --num_layers 2
    
    • Train CodeBERT:
    python bert.py
    

Dependencies

Authors

  • Djamel Mesbah
  • Nour El Madhoun
  • Hani Chalouati
  • Khaldoun Al Agha

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Code Smell Detection
Code Quality Analysis

Source

Organization: github

Created: 11/26/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.