Dataset assetOpen Source CommunityNatural Language ProcessingProgramming Language Processing

neulab/conala

The CoNaLa dataset is a benchmark for code generation tasks, containing code‑natural language pairs. The data were crawled from Stack Overflow, automatically filtered and manually annotated, comprising 2,379 training samples and 500 test samples. Additionally, an automatically mined set with nearly 600,000 samples is provided. The dataset is used to evaluate code generation, with English language and Python code. It includes two versions: a manually annotated version and an automatically mined version, each with different fields and splits.

Source

hugging_face

Created

Nov 28, 2025

Updated

Oct 20, 2022

Signals

132 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Name: CoNaLa Type: Code‑Natural Language Pair Dataset Purpose: Evaluate code generation tasks Source: Crawled from Stack Overflow and processed via automatic filtering and manual annotation Size:

Manually annotated set: 2,379 training examples, 500 test examples
Automatically mined set: ~593,891 examples

Dataset Structure

Manually Annotated Set:
- Fields: [question_id, intent, rewritten_intent, snippet]
- Training: 2,379 examples
- Test: 500 examples
Automatically Mined Set:
- Fields: [question_id, parent_answer_post_id, prob, snippet, intent, id]
- Training: 593,891 examples

Data Instances

Manually Annotated Example:

{
    "question_id": 41067960,
    "intent": "How to convert a list of multiple integers into a single integer?",
    "rewritten_intent": "Concatenate elements of a list x of multiple integers to a single integer",
    "snippet": "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
}

Automatically Mined Example:

{
    "question_id": 34705205,
    "parent_answer_post_id": 34705233,
    "prob": 0.8690001442846342,
    "snippet": "sorted(l, key=lambda x: (-int(x[1]), x[0]))",
    "intent": "Sort a nested list by two elements",
    "id": "34705205_34705233_0"
}

Data Fields

Manually Annotated Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | intent | string | Natural‑language intent (the original question title) | | rewritten_intent | string | Crowdsourced revised intent reflecting the full meaning of the code | | snippet | string | Code snippet implementing the intent |
Automatically Mined Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | parent_answer_post_id | int64 | Answer post ID from which the candidate code snippet was extracted | | intent | string | Natural‑language intent (the original question title) | | snippet | string | Code snippet implementing the intent | | id | string | Unique ID for the intent/code pair | | prob | float64 | Probability assigned by the mining model |

Data Splits

Manually Annotated Set: training and test splits
Automatically Mined Set: training split only

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio