Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingProgramming Language Processing

neulab/conala

The CoNaLa dataset is a benchmark for code generation tasks, containing code‑natural language pairs. The data were crawled from Stack Overflow, automatically filtered and manually annotated, comprising 2,379 training samples and 500 test samples. Additionally, an automatically mined set with nearly 600,000 samples is provided. The dataset is used to evaluate code generation, with English language and Python code. It includes two versions: a manually annotated version and an automatically mined version, each with different fields and splits.

Source
hugging_face
Created
Nov 28, 2025
Updated
Oct 20, 2022
Signals
132 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Name: CoNaLa Type: Code‑Natural Language Pair Dataset Purpose: Evaluate code generation tasks Source: Crawled from Stack Overflow and processed via automatic filtering and manual annotation Size:

  • Manually annotated set: 2,379 training examples, 500 test examples
  • Automatically mined set: ~593,891 examples

Dataset Structure

  • Manually Annotated Set:

    • Fields: [question_id, intent, rewritten_intent, snippet]
    • Training: 2,379 examples
    • Test: 500 examples
  • Automatically Mined Set:

    • Fields: [question_id, parent_answer_post_id, prob, snippet, intent, id]
    • Training: 593,891 examples

Data Instances

  • Manually Annotated Example:
{
    "question_id": 41067960,
    "intent": "How to convert a list of multiple integers into a single integer?",
    "rewritten_intent": "Concatenate elements of a list x of multiple integers to a single integer",
    "snippet": "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
}
  • Automatically Mined Example:
{
    "question_id": 34705205,
    "parent_answer_post_id": 34705233,
    "prob": 0.8690001442846342,
    "snippet": "sorted(l, key=lambda x: (-int(x[1]), x[0]))",
    "intent": "Sort a nested list by two elements",
    "id": "34705205_34705233_0"
}

Data Fields

  • Manually Annotated Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | intent | string | Natural‑language intent (the original question title) | | rewritten_intent | string | Crowdsourced revised intent reflecting the full meaning of the code | | snippet | string | Code snippet implementing the intent |

  • Automatically Mined Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | parent_answer_post_id | int64 | Answer post ID from which the candidate code snippet was extracted | | intent | string | Natural‑language intent (the original question title) | | snippet | string | Code snippet implementing the intent | | id | string | Unique ID for the intent/code pair | | prob | float64 | Probability assigned by the mining model |

Data Splits

  • Manually Annotated Set: training and test splits
  • Automatically Mined Set: training split only
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio