Explore high-quality datasets for your AI and machine learning projects.
The dataset comprises four features: instance_id (integer), prompt (string), canonical_solution (string), and test (string). It is divided into four parts: training set (train), test set (test), validation set (validation), and prompt set (prompt). Each part has corresponding file paths and sample counts. The total download size is 228,122 bytes, and the total dataset size is 500,198 bytes.
The dataset comprises programming‑related questions and starter code. Each entry includes a difficulty level, input‑output examples, public input‑output examples, title, source, date, and a unique ID. The dataset is split into a test set containing 35 examples, with a total size of 330,915,898.29646015 bytes and a download size of 222,291,880 bytes. The configuration name is the default configuration, and data files are located at `data/test-*`.
HumanEval-X is a benchmark dataset for evaluating the multilingual capabilities of code‑generation models. It comprises 820 high‑quality human‑written samples covering Python, C++, Java, JavaScript, and Go, each accompanied by test cases. The dataset can be used for code generation, translation, and related tasks.
The Web2Code dataset was created by MBZUAI to improve multimodal large language models' (MLLMs) capabilities in web understanding and HTML code generation. It comprises 11.797 million web instruction‑response pairs, including webpage images, HTML code, and structured questions and answers. The dataset was constructed using GPT‑3.5 and GPT‑4 for data cleaning and new data generation. Web2Code is primarily used for web content generation and task automation, addressing the shortcomings of existing MLLMs in handling web screenshots and generating HTML code.
SAFIM (Syntax-Aware Fill-in-the-Middle) is a benchmark for evaluating large language models (LLMs) on code fill-in-the-middle (FIM) tasks. SAFIM comprises three sub-tasks: algorithmic block completion, control-flow expression completion, and API function call completion. The dataset is sourced from code submitted between April 2022 and January 2023 to minimize data contamination affecting evaluation results.
The Mostly Basic Python Problems (MBPP) dataset contains about 1,000 Python programming problems generated by crowdsourcing and experts, intended for evaluating code generation models. Each problem includes a task description, a code solution, and three automated test cases. The dataset is provided in two versions: full and sanitized, each comprising training, test, validation, and prompt partitions. It was created to assess code generation capabilities and was developed and annotated internally at Google through crowdsourcing efforts.