PQAref
The PQAref dataset is a reference question‑answering dataset for the biomedical domain, designed for fine‑tuning large language models. It comprises three components: an instruction (question), abstracts (relevant abstracts retrieved from PubMed, including PubMed ID, abstract title, and content), and an answer (expected answer with references in PubMed ID format). The dataset was created semi‑automatically, leveraging questions from the PubMedQA dataset.
Description
Dataset Overview
Dataset Name
PubMed Referenced Question Answering Dataset
Dataset Description
PQAref is a dataset for fine‑tuning large language models on reference‑based question answering in the biomedical domain.
Dataset Content
The dataset includes three parts:
- Instruction: The question to be answered.
- Abstracts: Ten relevant PubMed abstracts, each containing PubMed ID, abstract title, and abstract content.
- Answer: The expected answer, containing references formatted as PubMed IDs.
Dataset Creation Method
The dataset was created semi‑automatically, reusing questions from the PubMedQA dataset.
Dataset Features
- Input: string type
Dataset Splits
- Training set: 7,260 samples, 136,602,851.95652175 bytes.
- Validation set: 907 samples, 17,065,948.584650856 bytes.
- Test set: 908 samples, 17,084,764.40447958 bytes.
Dataset Size
- Download size: 82,888,007 bytes
- Total size: 170,753,564.9456522 bytes
Task Categories
- Text Generation
- Question Answering
- Summarization
Language
- English
Tags
- Biology
- Biomedical
Scale
- 10M < n < 100M
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 7/2/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.