High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

BiomixQA

The BiomixQA dataset is a biomedical question answering collection featuring two question types: multiple‑choice and true/false. It is used to evaluate the performance of knowledge‑graph‑enhanced retrieval‑augmented generation (KG‑RAG) frameworks across various large language models (LLMs). The dataset’s diversity lies in question formats and the covered biomedical concepts, making it especially suitable for assessing KG‑RAG performance. Additionally, the dataset supports research and development in biomedical NLP, knowledge graph reasoning, and QA systems. Sources include multiple biomedical knowledge graphs and databases such as SPOKE, DisGeNET, MONDO, SemMedDB, Monarch Initiative, and ROBOKOP.

huggingface

View Details

DAHL

Biomedical

Model Evaluation

DAHL is a long‑form biomedical text generation hallucination evaluation benchmark curated by Seoul National University. It comprises 8,573 questions across 29 categories sourced from PubMed Central biomedical research papers. Questions were automatically generated and manually filtered to ensure high quality and answerability. DAHL evaluates large language models' hallucination in the biomedical domain by decomposing model responses into atomic units for factual accuracy assessment, offering a deeper evaluation than traditional multiple‑choice tasks. Its primary applications lie in biomedical and clinical research to address factual conflicts in generated texts.

arXiv

View Details

rag-datasets/rag-mini-bioasq

Biomedical

Question Answering Systems

This dataset is primarily used for question answering and sentence similarity tasks in the biomedical domain. It includes two configurations: text‑corpus and question‑answer‑passages, each corresponding to different data file paths. The dataset originates from the training set of BioASQ Task 11b and subsets were generated using the `generate.py` script.

hugging_face

View Details

PQAref

Biomedical

Question Answering Systems

The PQAref dataset is a reference question‑answering dataset for the biomedical domain, designed for fine‑tuning large language models. It comprises three components: an instruction (question), abstracts (relevant abstracts retrieved from PubMed, including PubMed ID, abstract title, and content), and an answer (expected answer with references in PubMed ID format). The dataset was created semi‑automatically, leveraging questions from the PubMedQA dataset.

huggingface

View Details

MedPix-2.0

Biomedical

AI Applications

MedPix 2.0 is a comprehensive multimodal biomedical dataset for advanced AI applications. The dataset includes detailed clinical case information and images, supporting CT and MRI scans.

github

View Details