Dataset assetOpen Source CommunityNatural Language ProcessingDocument QA

DocuQA

This dataset is designed for testing document‑based question‑answering applications or APIs and accepts PDF documents as input. It contains 20 distinct documents, each accompanied by 5 different question types, for a total of 100 evaluation questions. Document types vary widely, including journal articles, news reports, financial statements, and tutorials, aiming to assess a QA system's ability to understand context, recognize keywords, and extract specific information.

Source

github

Created

Feb 14, 2024

Updated

Feb 15, 2024

Signals

277 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Document‑Based Question Answering Dataset

Purpose

To test PDF‑document‑based question‑answering applications or interfaces.

Content

Number of Documents: 20
Question Types per Document: 5 (total 100 questions)
Document Types:
- Journal articles (5): contain calculations, formulas, and numerical data
- News articles (5): contain specific headlines and dates
- Reports / Financial reports / News (5): contain specific numbers and monetary data
- Tutorials (5): provide step‑by‑step instructions, including numerical values and units

Questions & Answers

Question Design: Five question types per document, covering diverse aspects to comprehensively evaluate QA capability
Answer Format: Answer key based on ground‑truth answers

Accuracy Computation

Method: Calculate the proportion of questions answered "TRUE" out of the total to gauge the system's ability to extract accurate information from varied document types

Use Cases

Evaluate performance of QA systems handling heterogeneous document and question types

Citation

Authors: Fitria, Kaira Milani
Year: 2024
Dataset Name: DocuQA
Repository: figshare
DOI: https://doi.org/10.6084/m9.figshare.25223990

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio