Back to datasets
Dataset assetOpen Source CommunityGait AnalysisCode Clone Detection
CodeNet
The CodeNet dataset was created by the Graduate School of Informatics at Nagoya University and is primarily used to evaluate multilingual code clone detectors. It contains code from two online judge systems (AIZU OJ and AtCoder) in Java, Python, C, and C++. The dataset selects 12 sub‑datasets to reflect edit‑distance similarity ranges targeted by clone detectors, aiming to assess detector correctness. CodeNet is applied in software engineering for code clone detection, addressing limitations of existing detectors in language extensibility and detection performance.
Source
arXiv
Created
Sep 10, 2024
Updated
Sep 10, 2024
Signals
244 views
Availability
Linked source ready
Overview
Dataset description and usage context
MCCD_Benchmarking Dataset Overview
Dataset Download
- URL: https://www.dropbox.com/scl/fi/lezo5ul15qg6oudud82fx/RecallBenchmarking.zip?rlkey=pm5grsfq314s2cuw3bbdcm84l&dl=0
- Contents after extraction: Two folders
benchmark-dataandsimi-source, to be placed underMCCD_Benchmarking/recall/.
Dataset Usage
Execute Clone Detection
- Subset identifiers: "p02263", "p00048", "p00001", "p00000", "p02269", "p02256", "p02257", "p02265", "p00002", "p00003", "p00008", "p00050", "p02271", "p00005"
- Result file naming:
Language_problemId.csv - File format: Each line represents a clone pair, formatted as "segment1 file path, segment1 start line, segment1 endline, segment2 file path, segment2 start line, segment2 endline"
Add Target Tool
- Command:
python3 AddDetector.py ToolName - Parameter:
ToolNameis the identifier of the target clone detector
Import Detection Results
- Command:
python3 ImportClones.py ToolName ResultFolder [Language+] - Parameters:
ToolName: tool identifierResultFolder: path to the folder containing results for each subsetLanguage: optionally list target languages to import
Evaluation
- Command:
python3 Evaluation.py [ToolName] - Parameter:
ToolNameis the tool identifier; if omitted, all registered tools are evaluated
Output All Results
- Command:
python3 GroupedData.py
Remove Tool or Clone
- Command:
python3 CloneClones.py ToolName - Parameter:
ToolNameis the tool identifier - Command:
python3 RemoveDetector.py ToolName - Parameter:
ToolNameis the tool identifier
Dataset Characteristics
- Recall: Provided
- Precision: To be provided
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.