Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingDocument QA
DocuQA
This dataset is designed for testing document‑based question‑answering applications or APIs and accepts PDF documents as input. It contains 20 distinct documents, each accompanied by 5 different question types, for a total of 100 evaluation questions. Document types vary widely, including journal articles, news reports, financial statements, and tutorials, aiming to assess a QA system's ability to understand context, recognize keywords, and extract specific information.
Source
github
Created
Feb 14, 2024
Updated
Feb 15, 2024
Signals
277 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Name
Document‑Based Question Answering Dataset
Purpose
To test PDF‑document‑based question‑answering applications or interfaces.
Content
- Number of Documents: 20
- Question Types per Document: 5 (total 100 questions)
- Document Types:
- Journal articles (5): contain calculations, formulas, and numerical data
- News articles (5): contain specific headlines and dates
- Reports / Financial reports / News (5): contain specific numbers and monetary data
- Tutorials (5): provide step‑by‑step instructions, including numerical values and units
Questions & Answers
- Question Design: Five question types per document, covering diverse aspects to comprehensively evaluate QA capability
- Answer Format: Answer key based on ground‑truth answers
Accuracy Computation
- Method: Calculate the proportion of questions answered "TRUE" out of the total to gauge the system's ability to extract accurate information from varied document types
Use Cases
- Evaluate performance of QA systems handling heterogeneous document and question types
Citation
- Authors: Fitria, Kaira Milani
- Year: 2024
- Dataset Name: DocuQA
- Repository: figshare
- DOI: https://doi.org/10.6084/m9.figshare.25223990
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.