neulab/conala
The CoNaLa dataset is a benchmark for code generation tasks, containing code‑natural language pairs. The data were crawled from Stack Overflow, automatically filtered and manually annotated, comprising 2,379 training samples and 500 test samples. Additionally, an automatically mined set with nearly 600,000 samples is provided. The dataset is used to evaluate code generation, with English language and Python code. It includes two versions: a manually annotated version and an automatically mined version, each with different fields and splits.
Dataset description and usage context
Dataset Overview
Name: CoNaLa Type: Code‑Natural Language Pair Dataset Purpose: Evaluate code generation tasks Source: Crawled from Stack Overflow and processed via automatic filtering and manual annotation Size:
- Manually annotated set: 2,379 training examples, 500 test examples
- Automatically mined set: ~593,891 examples
Dataset Structure
-
Manually Annotated Set:
- Fields: [question_id, intent, rewritten_intent, snippet]
- Training: 2,379 examples
- Test: 500 examples
-
Automatically Mined Set:
- Fields: [question_id, parent_answer_post_id, prob, snippet, intent, id]
- Training: 593,891 examples
Data Instances
- Manually Annotated Example:
{
"question_id": 41067960,
"intent": "How to convert a list of multiple integers into a single integer?",
"rewritten_intent": "Concatenate elements of a list x of multiple integers to a single integer",
"snippet": "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
}
- Automatically Mined Example:
{
"question_id": 34705205,
"parent_answer_post_id": 34705233,
"prob": 0.8690001442846342,
"snippet": "sorted(l, key=lambda x: (-int(x[1]), x[0]))",
"intent": "Sort a nested list by two elements",
"id": "34705205_34705233_0"
}
Data Fields
-
Manually Annotated Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | intent | string | Natural‑language intent (the original question title) | | rewritten_intent | string | Crowdsourced revised intent reflecting the full meaning of the code | | snippet | string | Code snippet implementing the intent |
-
Automatically Mined Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | parent_answer_post_id | int64 | Answer post ID from which the candidate code snippet was extracted | | intent | string | Natural‑language intent (the original question title) | | snippet | string | Code snippet implementing the intent | | id | string | Unique ID for the intent/code pair | | prob | float64 | Probability assigned by the mining model |
Data Splits
- Manually Annotated Set: training and test splits
- Automatically Mined Set: training split only
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.