DAHL
DAHL is a long‑form biomedical text generation hallucination evaluation benchmark curated by Seoul National University. It comprises 8,573 questions across 29 categories sourced from PubMed Central biomedical research papers. Questions were automatically generated and manually filtered to ensure high quality and answerability. DAHL evaluates large language models' hallucination in the biomedical domain by decomposing model responses into atomic units for factual accuracy assessment, offering a deeper evaluation than traditional multiple‑choice tasks. Its primary applications lie in biomedical and clinical research to address factual conflicts in generated texts.
Description
DAHL Dataset Overview
Dataset Construction
- Source: Generated from research papers crawled from PMC.
- Generation: Questions created with gpt‑4‑1106‑preview and manually filtered for high quality.
Evaluation Procedure
- Automated Evaluation Pipeline: Consists of two stages:
- Segment responses into atomic units.
- Verify factuality of each atomic unit.
Installation & Usage
-
Installation:
git clone https://github.com/seemdog/DAHL.git cd DAHL -
Response Generation:
- HuggingFace Model:
python generate_response_hf.py --model meta‑llama/Meta‑Llama‑3‑8B‑Instruct --temperature 0.6 --max_new_tokens 256 - OpenAI Model:
python generate_response_gpt.py --model gpt‑4o --api_key YOUR_API_KEY --temperature 0.6
- HuggingFace Model:
-
Evaluation:
cd evaluate sh run.sh model_to_evaluate openAI_API_key perplexityAI_API_key model_to_use_perplexityAI
Result Storage
- Final DAHL Score: Saved in a
.txtfile.
Citation
- Citation: To be determined (TBD).
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 11/14/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.