Dataset assetOpen Source CommunityMulti-hop Question AnsweringHeterogeneous Information Reasoning

wenhu/hybrid_qa

HybridQA is a large‑scale question‑answering dataset that requires reasoning over heterogeneous information. Each question is linked to a Wikipedia table and multiple free‑form passages that are aligned with entities in the table. The questions are designed to need both tabular and textual information; missing either source makes the question unanswerable. The dataset includes a training set (62,682 instances), a validation set (3,466), and a test set (3,463). The language is English and it is released under a CC‑BY‑4.0 license.

Source

hugging_face

Created

Nov 28, 2025

Updated

Dec 18, 2023

Signals

146 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Description

Dataset Summary

HybridQA is a large‑scale QA dataset that requires reasoning over heterogeneous information. Each question is associated with a Wikipedia table and several free‑text passages linked to entities in the table. The questions are meant to aggregate both table and text information; lacking either makes the question impossible to answer.

Supported Tasks and Leaderboards

[More information to be added]

Language

The dataset is in English.

Dataset Structure

Data Instances

A typical data instance includes the following fields:

question_id (string)
question (string)
table_id (string)
answer_text (string)
question_postag (string)
table (dictionary):
- url (string)
- title (string)
- header (list of strings)
- data (list of dictionaries):
  - value (string)
  - urls (list of dictionaries):
    - url (string)
    - summary (string)
section_title (string)
section_text (string)
uid (string)
intro (string)

Data Splits

Training set: 62,682 instances
Validation set: 3,466 instances
Test set: 3,463 instances

Dataset Creation

Rationale

[More information to be added]

Source Data

[More information to be added]

Annotation

[More information to be added]

Usage Considerations

Societal Impact

[More information to be added]

Bias Discussion

[More information to be added]

Known Limitations

[More information to be added]

Additional Information

Curators

[More information to be added]

License

The dataset is licensed under the Creative Commons Attribution 4.0 International License.

Citation

@article{chen2020hybridqa, title={HybridQA: A Dataset of Multi‑Hop Question Answering over Tabular and Textual Data}, author={Chen, Wenhu and Zha, Hanwen and Chen, Zhiyu and Xiong, Wenhan and Wang, Hong and Wang, William}, journal={Findings of EMNLP 2020}, year={2020} }

Contributions

Thanks to @patil-suraj for adding this dataset.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio