Dataset assetOpen Source CommunityCode GenerationPython Programming

google-research-datasets/mbpp

The Mostly Basic Python Problems (MBPP) dataset contains about 1,000 Python programming problems generated by crowdsourcing and experts, intended for evaluating code generation models. Each problem includes a task description, a code solution, and three automated test cases. The dataset is provided in two versions: full and sanitized, each comprising training, test, validation, and prompt partitions. It was created to assess code generation capabilities and was developed and annotated internally at Google through crowdsourcing efforts.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jan 4, 2024

Signals

955 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Basic Information

Dataset Name: Mostly Basic Python Problems (mbpp)
Language: English
License: CC-BY-4.0
Multilinguality: Monolingual
Size Category: n<1K
Source Dataset: Raw Data
Task Category: Text-to-Text Generation
Tags: Code Generation

Dataset Structure

Configurations

full:
- Features:
  - task_id: int32
  - text: string
  - code: string
  - test_list: sequence of string
  - test_setup_code: string
  - challenge_test_list: sequence of string
- Splits:
  - train: 374 samples, 176,879 bytes
  - test: 500 samples, 244,104 bytes
  - validation: 90 samples, 42,405 bytes
  - prompt: 10 samples, 4,550 bytes
- Download Size: 236,069 bytes
- Dataset Size: 467,938 bytes
sanitized:
- Features:
  - source_file: string
  - task_id: int32
  - prompt: string
  - code: string
  - test_imports: sequence of string
  - test_list: sequence of string
- Splits:
  - train: 120 samples, 63,453 bytes
  - test: 257 samples, 132,720 bytes
  - validation: 43 samples, 20,050 bytes
  - prompt: 7 samples, 3,407 bytes
- Download Size: 115,422 bytes
- Dataset Size: 219,630 bytes

Data Examples

full:

{
    "task_id": 1,
    "text": "Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost[][] and a position (m, n) in cost[][]",
    "code": "R = 3\r\nC = 3\r\ndef min_cost(cost, m, n): \r\n\ttc = [[0 for x in range(C)] for x in range(R)] \r\n\ttc[0][0] = cost[0][0] \r\n\tfor i in range(1, m+1): \r\n\t\ttc[i][0] = tc[i-1][0] + cost[i][0] \r\n\tfor j in range(1, n+1): \r\n\t\ttc[0][j] = tc[0][j-1] + cost[0][j] \r\n\tfor i in range(1, m+1): \r\n\t\tfor j in range(1, n+1): \r\n\t\t\ttc[i][j] = min(tc[i-1][j-1], tc[i-1][j], tc[i][j-1]) + cost[i][j] \r\n\treturn tc[m][n]",
    "test_list": [
        "assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8",
        "assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12",
        "assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16"
    ],
    "test_setup_code": "",
    "challenge_test_list": []
}

sanitized:

{
    "source_file": "Benchmark Questions Verification V2.ipynb",
    "task_id": 2,
    "prompt": "Write a function to find the shared elements from the given two lists.",
    "code": "def similar_elements(test_tup1, test_tup2):\n  res = tuple(set(test_tup1) & set(test_tup2))\n  return (res) ",
    "test_imports": [],
    "test_list": [
        "assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))",
        "assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))",
        "assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))"
    ]
}

Data Fields

source_file: unknown
text/prompt: programming task description
code: solution to the programming task
test_setup_code/test_imports: code imports required to run the tests
test_list: test suite for validating the solution
challenge_test_list: additional, more challenging tests for deeper validation

Data Splits

Both full and sanitized versions contain four splits: train, evaluation, test, and prompt (used for few‑shot prompting, not for training).

Dataset Creation

Purpose: To evaluate code‑generation capabilities, a collection of simple programming tasks and their solutions was assembled.
Source: The dataset was built from scratch by internal crowdsourcing efforts at Google.
Annotation: The full version was created first; a subset then received a second round of refined task descriptions.

Usage Considerations

Execute generated Python code only in a secure sandbox, as it may be unsafe.
Social Impact: The dataset enables more reliable assessment of code‑generation models, helping to mitigate risks when deploying such models.
Known Limitations: Some task descriptions may be ambiguous or insufficient; the sanitized split aims to alleviate this through a second‑round of annotation improvements.

Additional Information

Curator: Google Research
License: CC-BY-4.0

Citation:

@article{austin2021program,
  title={Program Synthesis with Large Language Models},
  author={Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and others},
  journal={arXiv preprint arXiv:2108.07732},
  year={2021}
}

Contributors: @lvwerra

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio