JUHE API Marketplace
DATASET
Open Source Community

BiomixQA

The BiomixQA dataset is a biomedical question answering collection featuring two question types: multiple‑choice and true/false. It is used to evaluate the performance of knowledge‑graph‑enhanced retrieval‑augmented generation (KG‑RAG) frameworks across various large language models (LLMs). The dataset’s diversity lies in question formats and the covered biomedical concepts, making it especially suitable for assessing KG‑RAG performance. Additionally, the dataset supports research and development in biomedical NLP, knowledge graph reasoning, and QA systems. Sources include multiple biomedical knowledge graphs and databases such as SPOKE, DisGeNET, MONDO, SemMedDB, Monarch Initiative, and ROBOKOP.

Updated 9/4/2024
huggingface

Description

BiomixQA Dataset

Overview

BiomixQA is a carefully curated biomedical QA dataset consisting of two distinct components:

  1. Multiple‑Choice Questions (MCQ)
  2. True/False Questions

The dataset has been employed to benchmark knowledge‑graph‑enhanced retrieval‑augmented generation (KG‑RAG) frameworks across various large language models (LLMs). Its heterogeneous question formats and broad coverage of biomedical concepts make it particularly suitable for evaluating KG‑RAG performance.

Consequently, BiomixQA aims to support research and development in biomedical natural language processing, knowledge‑graph reasoning, and question‑answering systems.

Dataset Details

Components

1. Multiple‑Choice Questions (MCQ)

  • File: mcq_biomix.csv
  • Size: 306 items
  • Format: Each item presents five options with a single correct answer.

2. True/False Questions

  • File: true_false_biomix.csv
  • Size: 311 items
  • Format: Binary (True/False) questions.

Potential Applications

  1. Evaluation of biomedical QA systems
  2. Testing of NLP models in the biomedical domain
  3. Benchmarking various retrieval‑augmented generation (RAG) frameworks
  4. Supporting research on biomedical ontologies and knowledge graphs

Source Data

  1. SPOKE – a large‑scale biomedical knowledge graph (~40 M concepts, ~140 M relations) (Morris et al., 2023).
  2. DisGeNET – curated gene‑disease associations from databases, GWAS, animal models, and literature (Piñero et al., 2016).
  3. MONDO – disease ontology in OBO format (Vasilevsky et al., 2022).
  4. SemMedDB – semantic predictions extracted from PubMed citations (Kilicoglu et al., 2012).
  5. Monarch Initiative – disease‑gene association platform (Mungall et al., 2017).
  6. ROBOKOP – knowledge‑graph‑based biomedical data integration and analysis system (Bizon et al., 2019).

Citation

If you use this dataset in your research, please cite the following paper:

@article{soman2023biomedical,
  title={Biomedical knowledge graph‑enhanced prompt generation for large language models},
  author={Soman, Karthik and Rose, Peter W and Morris, John H and Akbas, Rabia E and Smith, Brett and Peetoom, Braian and Villouta‑Reyes, Catalina and Cerono, Gabriel and Shi, Yongmei and Rizk‑Jackson, Angela and others},
  journal={arXiv preprint arXiv:2311.17330},
  year={2023}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Biomedical
Question Answering Systems

Source

Organization: huggingface

Created: 9/4/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.