wenhu/hybrid_qa
HybridQA is a large‑scale question‑answering dataset that requires reasoning over heterogeneous information. Each question is linked to a Wikipedia table and multiple free‑form passages that are aligned with entities in the table. The questions are designed to need both tabular and textual information; missing either source makes the question unanswerable. The dataset includes a training set (62,682 instances), a validation set (3,466), and a test set (3,463). The language is English and it is released under a CC‑BY‑4.0 license.
Dataset description and usage context
Dataset Overview
Dataset Description
Dataset Summary
HybridQA is a large‑scale QA dataset that requires reasoning over heterogeneous information. Each question is associated with a Wikipedia table and several free‑text passages linked to entities in the table. The questions are meant to aggregate both table and text information; lacking either makes the question impossible to answer.
Supported Tasks and Leaderboards
[More information to be added]
Language
The dataset is in English.
Dataset Structure
Data Instances
A typical data instance includes the following fields:
question_id(string)question(string)table_id(string)answer_text(string)question_postag(string)table(dictionary):url(string)title(string)header(list of strings)data(list of dictionaries):value(string)urls(list of dictionaries):url(string)summary(string)
section_title(string)section_text(string)uid(string)intro(string)
Data Splits
- Training set: 62,682 instances
- Validation set: 3,466 instances
- Test set: 3,463 instances
Dataset Creation
Rationale
[More information to be added]
Source Data
[More information to be added]
Annotation
[More information to be added]
Usage Considerations
Societal Impact
[More information to be added]
Bias Discussion
[More information to be added]
Known Limitations
[More information to be added]
Additional Information
Curators
[More information to be added]
License
The dataset is licensed under the Creative Commons Attribution 4.0 International License.
Citation
@article{chen2020hybridqa, title={HybridQA: A Dataset of Multi‑Hop Question Answering over Tabular and Textual Data}, author={Chen, Wenhu and Zha, Hanwen and Chen, Zhiyu and Xiong, Wenhan and Wang, Hong and Wang, William}, journal={Findings of EMNLP 2020}, year={2020} }
Contributions
Thanks to @patil-suraj for adding this dataset.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.