Back to datasets
Dataset assetOpen Source CommunityCode Smell DetectionCode Quality Analysis
MLCQ
The MLCQ dataset is used for code smell detection experiments and contains code snippets along with relevant code metrics.
Source
github
Created
Nov 26, 2024
Updated
Nov 26, 2024
Signals
174 views
Availability
Linked source ready
Overview
Dataset description and usage context
MLCQ‑Experiments Dataset Overview
Source
- MLCQ Dataset: MLCQ dataset
Purpose
- Supports the paper "Exploring NLP Techniques for Code Smell Detection: A Comparative Study".
- Enables comparison of NLP‑based models with baseline approaches for detecting code smells.
Processing Steps
- Environment Setup:
- Create a conda environment and install dependencies.
- Commands:
conda create -n mlcqenv python=3.10 conda activate mlcqenv conda install -f requirements.txt - Data Extraction:
- Set a GitHub token for API access.
- Export token:
export GITHUB_TOKEN=<your_github_token> - Run the extractor:
python DataExtractor.py
- Baseline Model:
- Use the J48 decision‑tree as baseline.
- Compute code metrics with the Designite tool.
- Steps:
python baseline/MetricsExtractor.py(prepare .java files)python baseline/DesigniteRun.py(run Designite)python baseline/DatasetCreator.py(final dataset)python train.py(train & test)
- Model Training:
- Train BiLSTM with Attention:
python bilstm_attn_train.py --batch_size 16 --epochs 20 --learning_rate 0.0001 --hidden_dim 512 --num_layers 2- Train CodeBERT:
python bert.py
Dependencies
- Designite Tool: Designite tool
- CodeBERT Pre‑trained Model: CodeBERT on Hugging Face
Authors
- Djamel Mesbah
- Nour El Madhoun
- Hani Chalouati
- Khaldoun Al Agha
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.