Back to datasets
Dataset assetOpen Source CommunityMulti-hop Question AnsweringHeterogeneous Information Reasoning

wenhu/hybrid_qa

HybridQA is a large‑scale question‑answering dataset that requires reasoning over heterogeneous information. Each question is linked to a Wikipedia table and multiple free‑form passages that are aligned with entities in the table. The questions are designed to need both tabular and textual information; missing either source makes the question unanswerable. The dataset includes a training set (62,682 instances), a validation set (3,466), and a test set (3,463). The language is English and it is released under a CC‑BY‑4.0 license.

Source
hugging_face
Created
Nov 28, 2025
Updated
Dec 18, 2023
Signals
146 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Description

Dataset Summary

HybridQA is a large‑scale QA dataset that requires reasoning over heterogeneous information. Each question is associated with a Wikipedia table and several free‑text passages linked to entities in the table. The questions are meant to aggregate both table and text information; lacking either makes the question impossible to answer.

Supported Tasks and Leaderboards

[More information to be added]

Language

The dataset is in English.

Dataset Structure

Data Instances

A typical data instance includes the following fields:

  • question_id (string)
  • question (string)
  • table_id (string)
  • answer_text (string)
  • question_postag (string)
  • table (dictionary):
    • url (string)
    • title (string)
    • header (list of strings)
    • data (list of dictionaries):
      • value (string)
      • urls (list of dictionaries):
        • url (string)
        • summary (string)
  • section_title (string)
  • section_text (string)
  • uid (string)
  • intro (string)

Data Splits

  • Training set: 62,682 instances
  • Validation set: 3,466 instances
  • Test set: 3,463 instances

Dataset Creation

Rationale

[More information to be added]

Source Data

[More information to be added]

Annotation

[More information to be added]

Usage Considerations

Societal Impact

[More information to be added]

Bias Discussion

[More information to be added]

Known Limitations

[More information to be added]

Additional Information

Curators

[More information to be added]

License

The dataset is licensed under the Creative Commons Attribution 4.0 International License.

Citation

@article{chen2020hybridqa, title={HybridQA: A Dataset of Multi‑Hop Question Answering over Tabular and Textual Data}, author={Chen, Wenhu and Zha, Hanwen and Chen, Zhiyu and Xiong, Wenhan and Wang, Hong and Wang, William}, journal={Findings of EMNLP 2020}, year={2020} }

Contributions

Thanks to @patil-suraj for adding this dataset.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio