neulab/conala
The CoNaLa dataset is a benchmark for code generation tasks, containing code‑natural language pairs. The data were crawled from Stack Overflow, automatically filtered and manually annotated, comprising 2,379 training samples and 500 test samples. Additionally, an automatically mined set with nearly 600,000 samples is provided. The dataset is used to evaluate code generation, with English language and Python code. It includes two versions: a manually annotated version and an automatically mined version, each with different fields and splits.
Description
Dataset Overview
Name: CoNaLa Type: Code‑Natural Language Pair Dataset Purpose: Evaluate code generation tasks Source: Crawled from Stack Overflow and processed via automatic filtering and manual annotation Size:
- Manually annotated set: 2,379 training examples, 500 test examples
- Automatically mined set: ~593,891 examples
Dataset Structure
-
Manually Annotated Set:
- Fields: [question_id, intent, rewritten_intent, snippet]
- Training: 2,379 examples
- Test: 500 examples
-
Automatically Mined Set:
- Fields: [question_id, parent_answer_post_id, prob, snippet, intent, id]
- Training: 593,891 examples
Data Instances
- Manually Annotated Example:
{
"question_id": 41067960,
"intent": "How to convert a list of multiple integers into a single integer?",
"rewritten_intent": "Concatenate elements of a list x of multiple integers to a single integer",
"snippet": "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
}
- Automatically Mined Example:
{
"question_id": 34705205,
"parent_answer_post_id": 34705233,
"prob": 0.8690001442846342,
"snippet": "sorted(l, key=lambda x: (-int(x[1]), x[0]))",
"intent": "Sort a nested list by two elements",
"id": "34705205_34705233_0"
}
Data Fields
-
Manually Annotated Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | intent | string | Natural‑language intent (the original question title) | | rewritten_intent | string | Crowdsourced revised intent reflecting the full meaning of the code | | snippet | string | Code snippet implementing the intent |
-
Automatically Mined Set: | Field | Type | Description | |------|------|-------------| | question_id | int64 | Stack Overflow question ID | | parent_answer_post_id | int64 | Answer post ID from which the candidate code snippet was extracted | | intent | string | Natural‑language intent (the original question title) | | snippet | string | Code snippet implementing the intent | | id | string | Unique ID for the intent/code pair | | prob | float64 | Probability assigned by the mining model |
Data Splits
- Manually Annotated Set: training and test splits
- Automatically Mined Set: training split only
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.