wenhu/hybrid_qa
HybridQA is a large‑scale question‑answering dataset that requires reasoning over heterogeneous information. Each question is linked to a Wikipedia table and multiple free‑form passages that are aligned with entities in the table. The questions are designed to need both tabular and textual information; missing either source makes the question unanswerable. The dataset includes a training set (62,682 instances), a validation set (3,466), and a test set (3,463). The language is English and it is released under a CC‑BY‑4.0 license.
Description
Dataset Overview
Dataset Description
Dataset Summary
HybridQA is a large‑scale QA dataset that requires reasoning over heterogeneous information. Each question is associated with a Wikipedia table and several free‑text passages linked to entities in the table. The questions are meant to aggregate both table and text information; lacking either makes the question impossible to answer.
Supported Tasks and Leaderboards
[More information to be added]
Language
The dataset is in English.
Dataset Structure
Data Instances
A typical data instance includes the following fields:
question_id(string)question(string)table_id(string)answer_text(string)question_postag(string)table(dictionary):url(string)title(string)header(list of strings)data(list of dictionaries):value(string)urls(list of dictionaries):url(string)summary(string)
section_title(string)section_text(string)uid(string)intro(string)
Data Splits
- Training set: 62,682 instances
- Validation set: 3,466 instances
- Test set: 3,463 instances
Dataset Creation
Rationale
[More information to be added]
Source Data
[More information to be added]
Annotation
[More information to be added]
Usage Considerations
Societal Impact
[More information to be added]
Bias Discussion
[More information to be added]
Known Limitations
[More information to be added]
Additional Information
Curators
[More information to be added]
License
The dataset is licensed under the Creative Commons Attribution 4.0 International License.
Citation
@article{chen2020hybridqa, title={HybridQA: A Dataset of Multi‑Hop Question Answering over Tabular and Textual Data}, author={Chen, Wenhu and Zha, Hanwen and Chen, Zhiyu and Xiong, Wenhan and Wang, Hong and Wang, William}, journal={Findings of EMNLP 2020}, year={2020} }
Contributions
Thanks to @patil-suraj for adding this dataset.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.