Back to datasets
Dataset assetOpen Source CommunityGait AnalysisCode Clone Detection

CodeNet

The CodeNet dataset was created by the Graduate School of Informatics at Nagoya University and is primarily used to evaluate multilingual code clone detectors. It contains code from two online judge systems (AIZU OJ and AtCoder) in Java, Python, C, and C++. The dataset selects 12 sub‑datasets to reflect edit‑distance similarity ranges targeted by clone detectors, aiming to assess detector correctness. CodeNet is applied in software engineering for code clone detection, addressing limitations of existing detectors in language extensibility and detection performance.

Source
arXiv
Created
Sep 10, 2024
Updated
Sep 10, 2024
Signals
244 views
Availability
Linked source ready
Overview

Dataset description and usage context

MCCD_Benchmarking Dataset Overview

Dataset Download

Dataset Usage

Execute Clone Detection

  • Subset identifiers: "p02263", "p00048", "p00001", "p00000", "p02269", "p02256", "p02257", "p02265", "p00002", "p00003", "p00008", "p00050", "p02271", "p00005"
  • Result file naming: Language_problemId.csv
  • File format: Each line represents a clone pair, formatted as "segment1 file path, segment1 start line, segment1 endline, segment2 file path, segment2 start line, segment2 endline"

Add Target Tool

  • Command: python3 AddDetector.py ToolName
  • Parameter: ToolName is the identifier of the target clone detector

Import Detection Results

  • Command: python3 ImportClones.py ToolName ResultFolder [Language+]
  • Parameters:
    • ToolName: tool identifier
    • ResultFolder: path to the folder containing results for each subset
    • Language: optionally list target languages to import

Evaluation

  • Command: python3 Evaluation.py [ToolName]
  • Parameter: ToolName is the tool identifier; if omitted, all registered tools are evaluated

Output All Results

  • Command: python3 GroupedData.py

Remove Tool or Clone

  • Command: python3 CloneClones.py ToolName
  • Parameter: ToolName is the tool identifier
  • Command: python3 RemoveDetector.py ToolName
  • Parameter: ToolName is the tool identifier

Dataset Characteristics

  • Recall: Provided
  • Precision: To be provided
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio