NTU-NLP-sg/xCodeEval
xCodeEval is currently the largest executable multilingual multitask benchmark dataset, containing 25 million document‑level code examples covering approximately 7,500 unique problems across 17 programming languages. The dataset comprises seven tasks involving code understanding, generation, translation, and retrieval, and uses execution‑based evaluation. It also introduces a code execution engine, ExecEval, supporting all languages, and proposes a data splitting and selection scheme based on geometric mean and graph‑theoretic principles to balance the distribution of multiple attributes.
Description
Dataset Overview
Basic Information
- name: xCodeEval
- languages: code, English
- language creation method: discovery, expert generation
- license: cc-by-nc-4.0
- multilinguality: multilingual
- size: 1M<n<10M, 10M<n<100M
- source: raw data
Tags
- programming languages
- code
- program synthesis
- automatic code repair
- code retrieval
- code translation
- code classification
Task Categories
- translation
- token classification
- text‑to‑text generation
- text retrieval
- text generation
- text classification
- feature extraction
- question answering
Dataset Description
- xCodeEval is a large‑scale multilingual multitask benchmark, containing ~25M document‑level code examples, covering ~7.5K unique problems and 17 programming languages.
- The dataset includes seven tasks covering code understanding, generation, translation, and retrieval, evaluated via execution.
- Developed a multilingual code execution engine ExecEval supporting all languages.
- Proposed a data splitting and selection scheme based on geometric mean and graph‑theoretic principles to balance multi‑attribute data distribution.
Data Download
- Can be loaded via Hugging Face
load_dataset()API. - Data also downloadable via Git LFS from Hugging Face.
Task Details
- Tag Classification
- Code Compilation
- Program Synthesis
- Code Translation
- Automated Program Repair
- Code‑to‑Code Retrieval
- Natural Language‑to‑Code Retrieval
Shared Data
problem_descriptions.jsonlunittest_db.json
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.