Back to datasets
Dataset assetOpen Source CommunityBiomedicalQuestion Answering Systems

BiomixQA

The BiomixQA dataset is a biomedical question answering collection featuring two question types: multiple‑choice and true/false. It is used to evaluate the performance of knowledge‑graph‑enhanced retrieval‑augmented generation (KG‑RAG) frameworks across various large language models (LLMs). The dataset’s diversity lies in question formats and the covered biomedical concepts, making it especially suitable for assessing KG‑RAG performance. Additionally, the dataset supports research and development in biomedical NLP, knowledge graph reasoning, and QA systems. Sources include multiple biomedical knowledge graphs and databases such as SPOKE, DisGeNET, MONDO, SemMedDB, Monarch Initiative, and ROBOKOP.

Source
huggingface
Created
Sep 4, 2024
Updated
Sep 4, 2024
Signals
929 views
Availability
Linked source ready
Overview

Dataset description and usage context

BiomixQA Dataset

Overview

BiomixQA is a carefully curated biomedical QA dataset consisting of two distinct components:

  1. Multiple‑Choice Questions (MCQ)
  2. True/False Questions

The dataset has been employed to benchmark knowledge‑graph‑enhanced retrieval‑augmented generation (KG‑RAG) frameworks across various large language models (LLMs). Its heterogeneous question formats and broad coverage of biomedical concepts make it particularly suitable for evaluating KG‑RAG performance.

Consequently, BiomixQA aims to support research and development in biomedical natural language processing, knowledge‑graph reasoning, and question‑answering systems.

Dataset Details

Components

1. Multiple‑Choice Questions (MCQ)

  • File: mcq_biomix.csv
  • Size: 306 items
  • Format: Each item presents five options with a single correct answer.

2. True/False Questions

  • File: true_false_biomix.csv
  • Size: 311 items
  • Format: Binary (True/False) questions.

Potential Applications

  1. Evaluation of biomedical QA systems
  2. Testing of NLP models in the biomedical domain
  3. Benchmarking various retrieval‑augmented generation (RAG) frameworks
  4. Supporting research on biomedical ontologies and knowledge graphs

Source Data

  1. SPOKE – a large‑scale biomedical knowledge graph (~40 M concepts, ~140 M relations) (Morris et al., 2023).
  2. DisGeNET – curated gene‑disease associations from databases, GWAS, animal models, and literature (Piñero et al., 2016).
  3. MONDO – disease ontology in OBO format (Vasilevsky et al., 2022).
  4. SemMedDB – semantic predictions extracted from PubMed citations (Kilicoglu et al., 2012).
  5. Monarch Initiative – disease‑gene association platform (Mungall et al., 2017).
  6. ROBOKOP – knowledge‑graph‑based biomedical data integration and analysis system (Bizon et al., 2019).

Citation

If you use this dataset in your research, please cite the following paper:

@article{soman2023biomedical,
  title={Biomedical knowledge graph‑enhanced prompt generation for large language models},
  author={Soman, Karthik and Rose, Peter W and Morris, John H and Akbas, Rabia E and Smith, Brett and Peetoom, Braian and Villouta‑Reyes, Catalina and Cerono, Gabriel and Shi, Yongmei and Rizk‑Jackson, Angela and others},
  journal={arXiv preprint arXiv:2311.17330},
  year={2023}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio