JUHE API Marketplace
DATASET
Open Source Community

PediaBench

PediaBench is a Chinese dataset specifically designed to evaluate large language models (LLMs) on pediatric question‑answering tasks. Created by research teams at Guizhou University and East China Normal University, it contains 4,565 objective questions and 1,632 subjective questions covering 12 pediatric diseases. Sources include the Chinese National Medical Licensing Examination, university final exams, and pediatric diagnostic and treatment standards. The dataset was built by collecting questions from multiple reliable sources and applying comprehensive scoring criteria to assess LLMs in instruction following, knowledge understanding, and clinical case analysis. PediaBench addresses the lack of pediatric coverage in existing medical QA datasets, providing a thorough benchmark for LLMs in the pediatric domain.

Updated 12/9/2024
arXiv

Description

PediaBench: A Comprehensive Chinese Pediatric Dataset for Evaluating Large Language Models

1. Introduction

PediaBench is the first comprehensive Chinese pediatric dataset for assessing the performance of large language models (LLMs) in the medical domain, specifically pediatric question answering (QA). It comprises 4,565 objective questions and 1,632 subjective questions covering 12 typical pediatric disease groups and five distinct question types.

2. Dataset

2.1 Question Types

PediaBench includes the following five typical medical question types to evaluate an LLM acting as a pediatric AI assistant:

  • True or False (ToF): Determine whether a statement is factual.
  • Multiple Choice (MC): Choose one or more correct options from a list.
  • Pairing (PA): Match a sentence with the missing word from a candidate list.
  • Essay/Short Answer (ES): Provide a detailed explanation of a specific concept.
  • Case Analysis (CA): Diagnose and propose treatment based on a case description.

2.2 Dataset Statistics

PediaBench contains 5,749 questions distributed as follows:

  • True or False: 258
  • Multiple Choice: 3,576
  • Pairing: 283
  • Essay/Short Answer: 1,565
  • Case Analysis: 67

Except for case‑analysis questions, the remaining 5,682 questions are classified into 12 disease groups according to the WHO International Classification of Diseases (ICD‑11) standard.

2.3 Evaluation Metrics

To accurately assess each LLM's performance on pediatric QA, PediaBench employs difficulty‑aware scoring combined with automatic grading:

  • True/False and Multiple Choice: Accuracy is the basic metric, with a difficulty‑weighted score.
  • Pairing: Full correctness yields 3 points, partial correctness 1 point, otherwise 0.
  • Essay and Case Analysis: Open‑ended questions graded automatically by LLMs; case‑analysis questions carry twice the weight of essay questions.

3. Experiments

3.1 Main Results

We evaluated 20 general‑purpose and medical‑domain LLMs, including open‑source and commercial models. Most models performed well on certain disease groups but struggled uniformly on subjective questions across all groups.

3.2 Results by Disease Group

Scoring ratios were computed for each disease group. Models achieved the highest scores on the HCDA and DImS groups.

4. Usage Guide

  • The dataset resides in the /data directory. After obtaining model responses, compile answers for the five question types into a .xlsx file following the samples.xlsx format.
  • Run the evaluation script to obtain scores per question type, per disease group, and an overall weighted total.

5. Limitations

5.1 Scope

Although PediaBench contains a large number of pediatric questions, it cannot cover every pediatric disease and treatment method found in real‑world practice. Future work will expand to additional medical domains and consider stricter scoring strategies.

5.2 Ethical Data Collection

All source material for PediaBench is publicly available and free to use. Appropriate anonymization has been performed, and no patient‑identifying private information is included.

Citation

If this dataset benefits your research, please cite:

@misc{zhang2024pediabenchcomprehensivechinesepediatric, title={PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models}, author={Qian Zhang and Panfeng Chen and Jiali Li and Linkun Feng and Shuyu Liu and Mei Chen and Hui Li and Yanhao Wang}, year={2024}, eprint={2412.06287}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.06287} }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Pediatric Medicine
Language Model Evaluation

Source

Organization: arXiv

Created: 12/9/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.