Back to datasets
Dataset assetOpen Source CommunityCode GenerationPython Programming

google-research-datasets/mbpp

The Mostly Basic Python Problems (MBPP) dataset contains about 1,000 Python programming problems generated by crowdsourcing and experts, intended for evaluating code generation models. Each problem includes a task description, a code solution, and three automated test cases. The dataset is provided in two versions: full and sanitized, each comprising training, test, validation, and prompt partitions. It was created to assess code generation capabilities and was developed and annotated internally at Google through crowdsourcing efforts.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 4, 2024
Signals
955 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • Dataset Name: Mostly Basic Python Problems (mbpp)
  • Language: English
  • License: CC-BY-4.0
  • Multilinguality: Monolingual
  • Size Category: n<1K
  • Source Dataset: Raw Data
  • Task Category: Text-to-Text Generation
  • Tags: Code Generation

Dataset Structure

Configurations

  • full:

    • Features:
      • task_id: int32
      • text: string
      • code: string
      • test_list: sequence of string
      • test_setup_code: string
      • challenge_test_list: sequence of string
    • Splits:
      • train: 374 samples, 176,879 bytes
      • test: 500 samples, 244,104 bytes
      • validation: 90 samples, 42,405 bytes
      • prompt: 10 samples, 4,550 bytes
    • Download Size: 236,069 bytes
    • Dataset Size: 467,938 bytes
  • sanitized:

    • Features:
      • source_file: string
      • task_id: int32
      • prompt: string
      • code: string
      • test_imports: sequence of string
      • test_list: sequence of string
    • Splits:
      • train: 120 samples, 63,453 bytes
      • test: 257 samples, 132,720 bytes
      • validation: 43 samples, 20,050 bytes
      • prompt: 7 samples, 3,407 bytes
    • Download Size: 115,422 bytes
    • Dataset Size: 219,630 bytes

Data Examples

  • full:

    {
        "task_id": 1,
        "text": "Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost[][] and a position (m, n) in cost[][]",
        "code": "R = 3\r\nC = 3\r\ndef min_cost(cost, m, n): \r\n\ttc = [[0 for x in range(C)] for x in range(R)] \r\n\ttc[0][0] = cost[0][0] \r\n\tfor i in range(1, m+1): \r\n\t\ttc[i][0] = tc[i-1][0] + cost[i][0] \r\n\tfor j in range(1, n+1): \r\n\t\ttc[0][j] = tc[0][j-1] + cost[0][j] \r\n\tfor i in range(1, m+1): \r\n\t\tfor j in range(1, n+1): \r\n\t\t\ttc[i][j] = min(tc[i-1][j-1], tc[i-1][j], tc[i][j-1]) + cost[i][j] \r\n\treturn tc[m][n]",
        "test_list": [
            "assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8",
            "assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12",
            "assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16"
        ],
        "test_setup_code": "",
        "challenge_test_list": []
    }
    
  • sanitized:

    {
        "source_file": "Benchmark Questions Verification V2.ipynb",
        "task_id": 2,
        "prompt": "Write a function to find the shared elements from the given two lists.",
        "code": "def similar_elements(test_tup1, test_tup2):\n  res = tuple(set(test_tup1) & set(test_tup2))\n  return (res) ",
        "test_imports": [],
        "test_list": [
            "assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))",
            "assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))",
            "assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))"
        ]
    }
    

Data Fields

  • source_file: unknown
  • text/prompt: programming task description
  • code: solution to the programming task
  • test_setup_code/test_imports: code imports required to run the tests
  • test_list: test suite for validating the solution
  • challenge_test_list: additional, more challenging tests for deeper validation

Data Splits

  • Both full and sanitized versions contain four splits: train, evaluation, test, and prompt (used for few‑shot prompting, not for training).

Dataset Creation

  • Purpose: To evaluate code‑generation capabilities, a collection of simple programming tasks and their solutions was assembled.
  • Source: The dataset was built from scratch by internal crowdsourcing efforts at Google.
  • Annotation: The full version was created first; a subset then received a second round of refined task descriptions.

Usage Considerations

  • Execute generated Python code only in a secure sandbox, as it may be unsafe.
  • Social Impact: The dataset enables more reliable assessment of code‑generation models, helping to mitigate risks when deploying such models.
  • Known Limitations: Some task descriptions may be ambiguous or insufficient; the sanitized split aims to alleviate this through a second‑round of annotation improvements.

Additional Information

  • Curator: Google Research
  • License: CC-BY-4.0
  • Citation:
    @article{austin2021program,
      title={Program Synthesis with Large Language Models},
      author={Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and others},
      journal={arXiv preprint arXiv:2108.07732},
      year={2021}
    }
    
  • Contributors: @lvwerra
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio