CodeNet
The CodeNet dataset was created by the Graduate School of Informatics at Nagoya University and is primarily used to evaluate multilingual code clone detectors. It contains code from two online judge systems (AIZU OJ and AtCoder) in Java, Python, C, and C++. The dataset selects 12 sub‑datasets to reflect edit‑distance similarity ranges targeted by clone detectors, aiming to assess detector correctness. CodeNet is applied in software engineering for code clone detection, addressing limitations of existing detectors in language extensibility and detection performance.
Description
MCCD_Benchmarking Dataset Overview
Dataset Download
- URL: https://www.dropbox.com/scl/fi/lezo5ul15qg6oudud82fx/RecallBenchmarking.zip?rlkey=pm5grsfq314s2cuw3bbdcm84l&dl=0
- Contents after extraction: Two folders
benchmark-dataandsimi-source, to be placed underMCCD_Benchmarking/recall/.
Dataset Usage
Execute Clone Detection
- Subset identifiers: "p02263", "p00048", "p00001", "p00000", "p02269", "p02256", "p02257", "p02265", "p00002", "p00003", "p00008", "p00050", "p02271", "p00005"
- Result file naming:
Language_problemId.csv - File format: Each line represents a clone pair, formatted as "segment1 file path, segment1 start line, segment1 endline, segment2 file path, segment2 start line, segment2 endline"
Add Target Tool
- Command:
python3 AddDetector.py ToolName - Parameter:
ToolNameis the identifier of the target clone detector
Import Detection Results
- Command:
python3 ImportClones.py ToolName ResultFolder [Language+] - Parameters:
ToolName: tool identifierResultFolder: path to the folder containing results for each subsetLanguage: optionally list target languages to import
Evaluation
- Command:
python3 Evaluation.py [ToolName] - Parameter:
ToolNameis the tool identifier; if omitted, all registered tools are evaluated
Output All Results
- Command:
python3 GroupedData.py
Remove Tool or Clone
- Command:
python3 CloneClones.py ToolName - Parameter:
ToolNameis the tool identifier - Command:
python3 RemoveDetector.py ToolName - Parameter:
ToolNameis the tool identifier
Dataset Characteristics
- Recall: Provided
- Precision: To be provided
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 9/10/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.