JUHE API Marketplace
DATASET
Open Source Community

neulab/conala

The CoNaLa dataset is a benchmark for code generation tasks, containing code‑natural language pairs. The data were crawled from Stack Overflow, automatically filtered and manually annotated, comprising 2,379 training samples and 500 test samples. Additionally, an automatically mined set with nearly 600,000 samples is provided. The dataset is used to evaluate code generation, with English language and Python code. It includes two versions: a manually annotated version and an automatically mined version, each with different fields and splits.

Updated 10/20/2022
hugging_face

Description

Dataset Overview

Name: CoNaLa Type: Code‑Natural Language Pair Dataset Purpose: Evaluate code generation tasks Source: Crawled from Stack Overflow and processed via automatic filtering and manual annotation Size:

  • Manually annotated set: 2,379 training examples, 500 test examples
  • Automatically mined set: ~593,891 examples

Dataset Structure

  • Manually Annotated Set:

    • Fields: [question_id, intent, rewritten_intent, snippet]
    • Training: 2,379 examples
    • Test: 500 examples
  • Automatically Mined Set:

    • Fields: [question_id, parent_answer_post_id, prob, snippet, intent, id]
    • Training: 593,891 examples

Data Instances

  • Manually Annotated Example:
{
    "question_id": 41067960,
    "intent": "How to convert a list of multiple integers into a single integer?",
    "rewritten_intent": "Concatenate elements of a list x of multiple integers to a single integer",
    "snippet": "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
}
  • Automatically Mined Example:
{
    "question_id": 34705205,
    "parent_answer_post_id": 34705233,
    "prob": 0.8690001442846342,
    "snippet": "sorted(l, key=lambda x: (-int(x[1]), x[0]))",
    "intent": "Sort a nested list by two elements",
    "id": "34705205_34705233_0"
}

Data Fields

  • Manually Annotated Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | intent | string | Natural‑language intent (the original question title) | | rewritten_intent | string | Crowdsourced revised intent reflecting the full meaning of the code | | snippet | string | Code snippet implementing the intent |

  • Automatically Mined Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | parent_answer_post_id | int64 | Answer post ID from which the candidate code snippet was extracted | | intent | string | Natural‑language intent (the original question title) | | snippet | string | Code snippet implementing the intent | | id | string | Unique ID for the intent/code pair | | prob | float64 | Probability assigned by the mining model |

Data Splits

  • Manually Annotated Set: training and test splits
  • Automatically Mined Set: training split only

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Programming Language Processing
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.