morph-labs/MiniMuSiQue
--- language: - en license: apache-2.0 --- # MiniMuSiQue by Morph Labs  **https://morph.so/blog/self-teaching/** We describe two evaluation datasets that we have derived from the MuSiQue multi-hop question-answering dataset, called MiniMuSiQue-hard (filtered for questions answerable by GPT-4 but not GPT-3.5, where performance significantly degrades if the first pivot document is removed) and MiniMuSiQue-easy (a larger dataset of convoluted off-distribution single-hop question-answer pairs). ## Table of Contents 1. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#dataset-description" target="_blank">Dataset Description</a>** 2. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#uses" target="_blank">Uses</a>** 3. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#contact" target="_blank">Contact</a>** 4. **<a href="https://huggingface.co/morph-labs/MiniMuSiQue#blogpost-and-citation" target="_blank">Blogpost and Citation</a>** ### Dataset Description We refined the MuSiQue dataset to focus on questions that demand complex multi-hop reasoning, by selecting questions which (1) GPT-4 could answer but GPT-3.5 could not, and which (2) were not answerable without the context relevant to the first reasoning step (the "first hop pivot document") for each question. Specifically, we selected 768 random examples from the MuSiQue training set, ranked them based on a combined score of difficulty (measured by the difference in ROUGE-L recall between GPT-4 and GPT-3.5) and the necessity for multi-hop reasoning (assessed by the change in ROUGE-L recall when the first hop pivot document was removed). We refer to the top-ranked 128 examples as MiniMuSiQue, and obtain MiniMuSiQue-hard by associating the original difficult MuSiQue multi-hop question-answer pair to each example. To additionally test off-distribution single-hop factual recall, for each example we synthesized convoluted off-distribution single-hop question-answer pairs for up to five entities per document in MiniMuSiQue, resulting in the much larger single-hop dataset MiniMuSiQue-easy. Each MiniMuSiQue example consists of twenty documents sampled from different Wikipedia articles, to which we associate a hard MuSiQue multi-hop reasoning question for MiniMuSiQue, and many single-hop questions for MiniMuSiQue-easy. - **Developed by:** **<a href="https://www.morph.so" target="_blank">Morph Labs</a>** - **Refined from:** **<a href="https://arxiv.org/abs/2108.00573" target="_blank">MuSiQue</a>** - **Language(s):** English - **License:** **<a href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank">Apache 2.0</a>** ## Uses A particularly challenging form of question for models historically has been multi-hop questions, which require a series of interconnected reasoning steps over multiple documents. However, creating multi-hop questions that truly necessitate knowledge-based reasoning is challenging. For instance, early benchmarks like HotpotQA were found to be largely solvable through shortcuts. The construction of questions and corresponding contexts that avoid such shortcuts, and verifying their effectiveness, requires a comprehensive dataset development process. The MuSiQue dataset addresses many weaknesses of prior work and contains difficult multi-hop questions less susceptible to shortcuts. We derive MiniMuSiQue from the original MuSiQue to better assess model capabilities to answer multi-hop questions that truly necessitate knowledge-based reasoning. ## Contact hello@morph.so ## Blogpost and Citation **https://morph.so/blog/self-teaching/** @misc{MiniMuSiQue, title={MiniMuSiQue}, author={Morph Labs, Jesse Michael Han, Eric Yu, Bentley Long, Pranav Mital, Brando Miranda}, year={2023}}
Description
MiniMuSiQue by Morph Labs
Dataset Description
We filtered the MuSiQue multi‑hop QA dataset to obtain two evaluation sets: MiniMuSiQue‑hard and MiniMuSiQue‑easy. MiniMuSiQue‑hard contains questions that GPT‑4 can answer but GPT‑3.5 cannot, and become unanswerable when the context for the first reasoning step is removed. MiniMuSiQue‑easy is a larger dataset comprising complex, non‑distributed single‑hop QA pairs.
Specifically, we randomly selected 768 samples from the MuSiQue training set and ranked them based on difficulty (the ROUGE‑L recall gap between GPT‑4 and GPT‑3.5) and the necessity of multi‑hop reasoning (the ROUGE‑L recall change after removing the first hop document). The top 128 samples are designated as MiniMuSiQue, and we obtain MiniMuSiQue‑hard by linking the original MuSiQue multi‑hop QA pairs. To further test non‑distributed single‑hop factual recall, we synthesized complex non‑distributed single‑hop QA pairs for each document in MiniMuSiQue, up to five entities per document, yielding the larger single‑hop dataset MiniMuSiQue‑easy. Each MiniMuSiQue sample includes twenty documents extracted from different Wikipedia articles, linked to a MuSiQue multi‑hop reasoning question for MiniMuSiQue, and many single‑hop questions for MiniMuSiQue‑easy.
- Developer: Morph Labs
- Derived from: MuSiQue
- Language: English
- License: Apache 2.0
Use Cases
Multi‑hop questions remain a challenge for models, requiring reasoning across a series of interrelated steps spanning multiple documents. However, constructing truly knowledge‑based multi‑hop questions is difficult. Early benchmarks such as HotpotQA were found to be solvable by shortcuts. Creating shortcut‑free questions and verifying their validity demands a comprehensive dataset development process. The MuSiQue dataset addressed many weaknesses of prior work and includes multi‑hop questions that are hard to answer via shortcuts. We derived MiniMuSiQue from the original MuSiQue to better evaluate a model's ability to answer genuinely knowledge‑based multi‑hop questions.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.