JUHE API Marketplace
DATASET
Open Source Community

NTU-NLP-sg/xCodeEval

xCodeEval is currently the largest executable multilingual multitask benchmark dataset, containing 25 million document‑level code examples covering approximately 7,500 unique problems across 17 programming languages. The dataset comprises seven tasks involving code understanding, generation, translation, and retrieval, and uses execution‑based evaluation. It also introduces a code execution engine, ExecEval, supporting all languages, and proposes a data splitting and selection scheme based on geometric mean and graph‑theoretic principles to balance the distribution of multiple attributes.

Updated 6/6/2024
hugging_face

Description

Dataset Overview

Basic Information

  • name: xCodeEval
  • languages: code, English
  • language creation method: discovery, expert generation
  • license: cc-by-nc-4.0
  • multilinguality: multilingual
  • size: 1M<n<10M, 10M<n<100M
  • source: raw data

Tags

  • programming languages
  • code
  • program synthesis
  • automatic code repair
  • code retrieval
  • code translation
  • code classification

Task Categories

  • translation
  • token classification
  • text‑to‑text generation
  • text retrieval
  • text generation
  • text classification
  • feature extraction
  • question answering

Dataset Description

  • xCodeEval is a large‑scale multilingual multitask benchmark, containing ~25M document‑level code examples, covering ~7.5K unique problems and 17 programming languages.
  • The dataset includes seven tasks covering code understanding, generation, translation, and retrieval, evaluated via execution.
  • Developed a multilingual code execution engine ExecEval supporting all languages.
  • Proposed a data splitting and selection scheme based on geometric mean and graph‑theoretic principles to balance multi‑attribute data distribution.

Data Download

  • Can be loaded via Hugging Face load_dataset() API.
  • Data also downloadable via Git LFS from Hugging Face.

Task Details

  1. Tag Classification
  2. Code Compilation
  3. Program Synthesis
  4. Code Translation
  5. Automated Program Repair
  6. Code‑to‑Code Retrieval
  7. Natural Language‑to‑Code Retrieval

Shared Data

  • problem_descriptions.jsonl
  • unittest_db.json

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Code Analysis
Multilingual Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.