Dataset assetOpen Source CommunityBiomedicalQuestion Answering Systems

BiomixQA

The BiomixQA dataset is a biomedical question answering collection featuring two question types: multiple‑choice and true/false. It is used to evaluate the performance of knowledge‑graph‑enhanced retrieval‑augmented generation (KG‑RAG) frameworks across various large language models (LLMs). The dataset’s diversity lies in question formats and the covered biomedical concepts, making it especially suitable for assessing KG‑RAG performance. Additionally, the dataset supports research and development in biomedical NLP, knowledge graph reasoning, and QA systems. Sources include multiple biomedical knowledge graphs and databases such as SPOKE, DisGeNET, MONDO, SemMedDB, Monarch Initiative, and ROBOKOP.

Source

huggingface

Created

Sep 4, 2024

Updated

Sep 4, 2024

Signals

929 views

Availability

Linked source ready

Overview

Dataset description and usage context

BiomixQA Dataset

Overview

BiomixQA is a carefully curated biomedical QA dataset consisting of two distinct components:

Multiple‑Choice Questions (MCQ)
True/False Questions

The dataset has been employed to benchmark knowledge‑graph‑enhanced retrieval‑augmented generation (KG‑RAG) frameworks across various large language models (LLMs). Its heterogeneous question formats and broad coverage of biomedical concepts make it particularly suitable for evaluating KG‑RAG performance.

Consequently, BiomixQA aims to support research and development in biomedical natural language processing, knowledge‑graph reasoning, and question‑answering systems.

Dataset Details

Repository: https://github.com/BaranziniLab/KG_RAG
Paper: Biomedical knowledge graph‑optimized prompt generation for large language models
Contact: Karthik Soman

Components

1. Multiple‑Choice Questions (MCQ)

File: mcq_biomix.csv
Size: 306 items
Format: Each item presents five options with a single correct answer.

2. True/False Questions

File: true_false_biomix.csv
Size: 311 items
Format: Binary (True/False) questions.

Potential Applications

Evaluation of biomedical QA systems
Testing of NLP models in the biomedical domain
Benchmarking various retrieval‑augmented generation (RAG) frameworks
Supporting research on biomedical ontologies and knowledge graphs

Source Data

SPOKE – a large‑scale biomedical knowledge graph (~40 M concepts, ~140 M relations) (Morris et al., 2023).
DisGeNET – curated gene‑disease associations from databases, GWAS, animal models, and literature (Piñero et al., 2016).
MONDO – disease ontology in OBO format (Vasilevsky et al., 2022).
SemMedDB – semantic predictions extracted from PubMed citations (Kilicoglu et al., 2012).
Monarch Initiative – disease‑gene association platform (Mungall et al., 2017).
ROBOKOP – knowledge‑graph‑based biomedical data integration and analysis system (Bizon et al., 2019).

Citation

If you use this dataset in your research, please cite the following paper:

@article{soman2023biomedical,
  title={Biomedical knowledge graph‑enhanced prompt generation for large language models},
  author={Soman, Karthik and Rose, Peter W and Morris, John H and Akbas, Rabia E and Smith, Brett and Peetoom, Braian and Villouta‑Reyes, Catalina and Cerono, Gabriel and Shi, Yongmei and Rizk‑Jackson, Angela and others},
  journal={arXiv preprint arXiv:2311.17330},
  year={2023}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio