Dataset assetOpen Source CommunityMultimodal LearningQuestion Answering Systems

OpenGVLab/MM-NIAH

Needle In A Multimodal Haystack (MM‑NIAH) is a comprehensive benchmark designed to systematically evaluate the capability of existing multimodal large language models (MLLMs) in understanding long multimodal documents. The benchmark requires models to answer specific questions based on key information scattered throughout multimodal documents. MM‑NIAH's evaluation data comprises three tasks: retrieval, counting, and reasoning. Key information (called “needles”) is embedded in the document's text or images; those inserted into text are referred to as text needles, and those inserted into images as image needles. Experimental results indicate that current MLLMs perform poorly when handling image‑based key information.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jun 17, 2024

Signals

142 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Basic Information

License: MIT
Task Category: Question Answering
Language: English
Size: 10K < n < 100K

Configuration Details

Config Name: val
Data Files:
- Split: val
  - Path: mm_niah_val/annotations/reasoning-text.jsonl
- Split: test
  - Path: mm_niah_test/annotations/reasoning-text.jsonl

Dataset Introduction

Name: Needle In A Multimodal Haystack (MM‑NIAH)
Purpose: Evaluate the ability of existing multimodal large language models (MLLMs) to understand long multimodal documents.
Task Types: Consist of three tasks, namely retrieval, counting, and reasoning.
Data Structure: The “needles” inserted into documents can be either text or images, referred to as text needles and image needles respectively.

Main Findings

State‑of‑the‑art MLLMs (e.g., Gemini‑1.5) still struggle with understanding multimodal documents.
All MLLMs perform poorly on image needles.
MLLMs cannot accurately identify the number of images within a document.
Models pretrained on image‑text interleaved data do not exhibit superior performance.
Training on background documents does not improve performance on MM‑NIAH.
An “intermediate loss” issue exists within MLLMs.
Long‑context capability of LLMs is not retained in MLLMs.
Retrieval‑augmented generation (RAG) improves performance on text‑needle retrieval but not on image‑needle retrieval.
Placing the question before the context does not boost model performance.
Human performance on MM‑NIAH is near‑perfect.

Experimental Results

Evaluation Metrics:
- Retrieval and Reasoning Tasks: Accuracy.
- Counting Task: Soft Accuracy, defined as $\frac{1}{N} \sum_{i=1}^{N} \frac{m_i}{M_i}$, where $m_i$ is the number of matched elements between the predicted list and the ground‑truth list at the corresponding position, and $M_i$ is the number of elements in the ground‑truth list for sample $i$.

Evaluation Procedure

Score Computation:
- Prepare model responses in JSONL format, then run the script calculate_scores.py to obtain heatmaps and scores.
- Example command: python calculate_scores.py --outputs-dir /path/to/your/responses

Data Format

Data Structure:
- id: Integer starting from 0; each task type has an independent ID.
- images_list: List of length N, each element being the relative path of an image.
- context: Multimodal document, using <image> as an image placeholder.
- question: The question.
- answer: Ground‑truth answer, which may be a string, integer, or list.
- meta: Records various statistics, including insertion depth, context length, token counts for text and images, number of images, inserted needles, candidate textual and visual answers, etc.

Notes

Note 1: The number of <image> tokens in the context and question equals the length of images_list.
Note 2: Save as JSONL files, each line being a dictionary.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio