DAHL

DAHL is a long‑form biomedical text generation hallucination evaluation benchmark curated by Seoul National University. It comprises 8,573 questions across 29 categories sourced from PubMed Central biomedical research papers. Questions were automatically generated and manually filtered to ensure high quality and answerability. DAHL evaluates large language models' hallucination in the biomedical domain by decomposing model responses into atomic units for factual accuracy assessment, offering a deeper evaluation than traditional multiple‑choice tasks. Its primary applications lie in biomedical and clinical research to address factual conflicts in generated texts.

Updated 11/14/2024

arXiv

Description

DAHL Dataset Overview

Dataset Construction

Source: Generated from research papers crawled from PMC.
Generation: Questions created with gpt‑4‑1106‑preview and manually filtered for high quality.

Evaluation Procedure

Automated Evaluation Pipeline: Consists of two stages:
1. Segment responses into atomic units.
2. Verify factuality of each atomic unit.

Installation & Usage

Installation:

git clone https://github.com/seemdog/DAHL.git
cd DAHL

Response Generation:

HuggingFace Model:

python generate_response_hf.py --model meta‑llama/Meta‑Llama‑3‑8B‑Instruct --temperature 0.6 --max_new_tokens 256

OpenAI Model:

python generate_response_gpt.py --model gpt‑4o --api_key YOUR_API_KEY --temperature 0.6

Evaluation:

cd evaluate
sh run.sh model_to_evaluate openAI_API_key perplexityAI_API_key model_to_use_perplexityAI

Result Storage

Final DAHL Score: Saved in a .txt file.

Citation

Citation: To be determined (TBD).

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Biomedical

Model Evaluation

Source

Organization: arXiv

Created: 11/14/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →