JUHE API Marketplace
DATASET
Open Source Community

wenhu/hybrid_qa

HybridQA is a large‑scale question‑answering dataset that requires reasoning over heterogeneous information. Each question is linked to a Wikipedia table and multiple free‑form passages that are aligned with entities in the table. The questions are designed to need both tabular and textual information; missing either source makes the question unanswerable. The dataset includes a training set (62,682 instances), a validation set (3,466), and a test set (3,463). The language is English and it is released under a CC‑BY‑4.0 license.

Updated 12/18/2023
hugging_face

Description

Dataset Overview

Dataset Description

Dataset Summary

HybridQA is a large‑scale QA dataset that requires reasoning over heterogeneous information. Each question is associated with a Wikipedia table and several free‑text passages linked to entities in the table. The questions are meant to aggregate both table and text information; lacking either makes the question impossible to answer.

Supported Tasks and Leaderboards

[More information to be added]

Language

The dataset is in English.

Dataset Structure

Data Instances

A typical data instance includes the following fields:

  • question_id (string)
  • question (string)
  • table_id (string)
  • answer_text (string)
  • question_postag (string)
  • table (dictionary):
    • url (string)
    • title (string)
    • header (list of strings)
    • data (list of dictionaries):
      • value (string)
      • urls (list of dictionaries):
        • url (string)
        • summary (string)
  • section_title (string)
  • section_text (string)
  • uid (string)
  • intro (string)

Data Splits

  • Training set: 62,682 instances
  • Validation set: 3,466 instances
  • Test set: 3,463 instances

Dataset Creation

Rationale

[More information to be added]

Source Data

[More information to be added]

Annotation

[More information to be added]

Usage Considerations

Societal Impact

[More information to be added]

Bias Discussion

[More information to be added]

Known Limitations

[More information to be added]

Additional Information

Curators

[More information to be added]

License

The dataset is licensed under the Creative Commons Attribution 4.0 International License.

Citation

@article{chen2020hybridqa, title={HybridQA: A Dataset of Multi‑Hop Question Answering over Tabular and Textual Data}, author={Chen, Wenhu and Zha, Hanwen and Chen, Zhiyu and Xiong, Wenhan and Wang, Hong and Wang, William}, journal={Findings of EMNLP 2020}, year={2020} }

Contributions

Thanks to @patil-suraj for adding this dataset.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Multi-hop Question Answering
Heterogeneous Information Reasoning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.