Dataset assetOpen Source CommunityModel EvaluationMulti‑hop Question Answering

morph-labs/MiniMuSiQue

--- language: - en license: apache-2.0 --- # MiniMuSiQue by Morph Labs ![banner](https://pbs.twimg.com/profile_images/1669255916980686848/mTW-mxbC_400x400.jpg) **https://morph.so/blog/self-teaching/** We describe two evaluation datasets that we have derived from the MuSiQue multi-hop question-answering dataset, called MiniMuSiQue-hard (filtered for questions answerable by GPT-4 but not GPT-3.5, where performance significantly degrades if the first pivot document is removed) and MiniMuSiQue-easy (a larger dataset of convoluted off-distribution single-hop question-answer pairs). ## Table of Contents 1. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#dataset-description" target="_blank">Dataset Description</a>** 2. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#uses" target="_blank">Uses</a>** 3. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#contact" target="_blank">Contact</a>** 4. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#blogpost-and-citation" target="_blank">Blogpost and Citation</a>** ### Dataset Description We refined the MuSiQue dataset to focus on questions that demand complex multi-hop reasoning, by selecting questions which (1) GPT-4 could answer but GPT-3.5 could not, and which (2) were not answerable without the context relevant to the first reasoning step (the "first hop pivot document") for each question. Specifically, we selected 768 random examples from the MuSiQue training set, ranked them based on a combined score of difficulty (measured by the difference in ROUGE-L recall between GPT-4 and GPT-3.5) and the necessity for multi-hop reasoning (assessed by the change in ROUGE-L recall when the first hop pivot document was removed). We refer to the top-ranked 128 examples as MiniMuSiQue, and obtain MiniMuSiQue-hard by associating the original difficult MuSiQue multi-hop question-answer pair to each example. To additionally test off-distribution single-hop factual recall, for each example we synthesized convoluted off-distribution single-hop question-answer pairs for up to five entities per document in MiniMuSiQue, resulting in the much larger single-hop dataset MiniMuSiQue-easy. Each MiniMuSiQue example consists of twenty documents sampled from different Wikipedia articles, to which we associate a hard MuSiQue multi-hop reasoning question for MiniMuSiQue, and many single-hop questions for MiniMuSiQue-easy. - **Developed by:** **<a href="https://www.morph.so" target="_blank">Morph Labs</a>** - **Refined from:** **<a href="https://arxiv.org/abs/2108.00573" target="_blank">MuSiQue</a>** - **Language(s):** English - **License:** **<a href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank">Apache 2.0</a>** ## Uses A particularly challenging form of question for models historically has been multi-hop questions, which require a series of interconnected reasoning steps over multiple documents. However, creating multi-hop questions that truly necessitate knowledge-based reasoning is challenging. For instance, early benchmarks like HotpotQA were found to be largely solvable through shortcuts. The construction of questions and corresponding contexts that avoid such shortcuts, and verifying their effectiveness, requires a comprehensive dataset development process. The MuSiQue dataset addresses many weaknesses of prior work and contains difficult multi-hop questions less susceptible to shortcuts. We derive MiniMuSiQue from the original MuSiQue to better assess model capabilities to answer multi-hop questions that truly necessitate knowledge-based reasoning. ## Contact hello@morph.so ## Blogpost and Citation **https://morph.so/blog/self-teaching/** @misc{MiniMuSiQue, title={MiniMuSiQue}, author={Morph Labs, Jesse Michael Han, Eric Yu, Bentley Long, Pranav Mital, Brando Miranda}, year={2023}}

Source

hugging_face

Created

Nov 28, 2025

Updated

Dec 5, 2023

Signals

151 views

Availability

Linked source ready

Overview

Dataset description and usage context

MiniMuSiQue by Morph Labs

Dataset Description

We filtered the MuSiQue multi‑hop QA dataset to obtain two evaluation sets: MiniMuSiQue‑hard and MiniMuSiQue‑easy. MiniMuSiQue‑hard contains questions that GPT‑4 can answer but GPT‑3.5 cannot, and become unanswerable when the context for the first reasoning step is removed. MiniMuSiQue‑easy is a larger dataset comprising complex, non‑distributed single‑hop QA pairs.

Specifically, we randomly selected 768 samples from the MuSiQue training set and ranked them based on difficulty (the ROUGE‑L recall gap between GPT‑4 and GPT‑3.5) and the necessity of multi‑hop reasoning (the ROUGE‑L recall change after removing the first hop document). The top 128 samples are designated as MiniMuSiQue, and we obtain MiniMuSiQue‑hard by linking the original MuSiQue multi‑hop QA pairs. To further test non‑distributed single‑hop factual recall, we synthesized complex non‑distributed single‑hop QA pairs for each document in MiniMuSiQue, up to five entities per document, yielding the larger single‑hop dataset MiniMuSiQue‑easy. Each MiniMuSiQue sample includes twenty documents extracted from different Wikipedia articles, linked to a MuSiQue multi‑hop reasoning question for MiniMuSiQue, and many single‑hop questions for MiniMuSiQue‑easy.

Developer: Morph Labs
Derived from: MuSiQue
Language: English
License: Apache 2.0

Use Cases

Multi‑hop questions remain a challenge for models, requiring reasoning across a series of interrelated steps spanning multiple documents. However, constructing truly knowledge‑based multi‑hop questions is difficult. Early benchmarks such as HotpotQA were found to be solvable by shortcuts. Creating shortcut‑free questions and verifying their validity demands a comprehensive dataset development process. The MuSiQue dataset addressed many weaknesses of prior work and includes multi‑hop questions that are hard to answer via shortcuts. We derived MiniMuSiQue from the original MuSiQue to better evaluate a model's ability to answer genuinely knowledge‑based multi‑hop questions.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio