Dataset Hub

codegenning/livecodebench_lite_v2_lite35

Programming Education

The dataset comprises programming‑related questions and starter code. Each entry includes a difficulty level, input‑output examples, public input‑output examples, title, source, date, and a unique ID. The dataset is split into a test set containing 35 examples, with a total size of 330,915,898.29646015 bytes and a download size of 222,291,880 bytes. The configuration name is the default configuration, and data files are located at `data/test-*`.

THUDM/humaneval-x

Multilingual Evaluation

HumanEval-X is a benchmark dataset for evaluating the multilingual capabilities of code‑generation models. It comprises 820 high‑quality human‑written samples covering Python, C++, Java, JavaScript, and Go, each accompanied by test cases. The dataset can be used for code generation, translation, and related tasks.

Web2Code

Web Technology

The Web2Code dataset was created by MBZUAI to improve multimodal large language models' (MLLMs) capabilities in web understanding and HTML code generation. It comprises 11.797 million web instruction‑response pairs, including webpage images, HTML code, and structured questions and answers. The dataset was constructed using GPT‑3.5 and GPT‑4 for data cleaning and new data generation. Web2Code is primarily used for web content generation and task automation, addressing the shortcomings of existing MLLMs in handling web screenshots and generating HTML code.

arXiv

gonglinyuan/safim

Code Infilling

SAFIM (Syntax-Aware Fill-in-the-Middle) is a benchmark for evaluating large language models (LLMs) on code fill-in-the-middle (FIM) tasks. SAFIM comprises three sub-tasks: algorithmic block completion, control-flow expression completion, and API function call completion. The dataset is sourced from code submitted between April 2022 and January 2023 to minimize data contamination affecting evaluation results.

google-research-datasets/mbpp

Python Programming

The Mostly Basic Python Problems (MBPP) dataset contains about 1,000 Python programming problems generated by crowdsourcing and experts, intended for evaluating code generation models. Each problem includes a task description, a code solution, and three automated test cases. The dataset is provided in two versions: full and sanitized, each comprising training, test, validation, and prompt partitions. It was created to assess code generation capabilities and was developed and annotated internally at Google through crowdsourcing efforts.