Dataset assetOpen Source CommunityBiomedicalQuestion Answering Systems

PQAref

The PQAref dataset is a reference question‑answering dataset for the biomedical domain, designed for fine‑tuning large language models. It comprises three components: an instruction (question), abstracts (relevant abstracts retrieved from PubMed, including PubMed ID, abstract title, and content), and an answer (expected answer with references in PubMed ID format). The dataset was created semi‑automatically, leveraging questions from the PubMedQA dataset.

Source

huggingface

Created

Jul 2, 2024

Updated

Jul 2, 2024

Signals

533 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

PubMed Referenced Question Answering Dataset

Dataset Description

PQAref is a dataset for fine‑tuning large language models on reference‑based question answering in the biomedical domain.

Dataset Content

The dataset includes three parts:

Instruction: The question to be answered.
Abstracts: Ten relevant PubMed abstracts, each containing PubMed ID, abstract title, and abstract content.
Answer: The expected answer, containing references formatted as PubMed IDs.

Dataset Creation Method

The dataset was created semi‑automatically, reusing questions from the PubMedQA dataset.

Dataset Features

Input: string type

Dataset Splits

Training set: 7,260 samples, 136,602,851.95652175 bytes.
Validation set: 907 samples, 17,065,948.584650856 bytes.
Test set: 908 samples, 17,084,764.40447958 bytes.

Dataset Size

Download size: 82,888,007 bytes
Total size: 170,753,564.9456522 bytes

Task Categories

Text Generation
Question Answering
Summarization

Language

English

Scale

10M < n < 100M

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio