OpenGVLab/MM-NIAH
Needle In A Multimodal Haystack (MM‑NIAH) is a comprehensive benchmark designed to systematically evaluate the capability of existing multimodal large language models (MLLMs) in understanding long multimodal documents. The benchmark requires models to answer specific questions based on key information scattered throughout multimodal documents. MM‑NIAH's evaluation data comprises three tasks: retrieval, counting, and reasoning. Key information (called “needles”) is embedded in the document's text or images; those inserted into text are referred to as text needles, and those inserted into images as image needles. Experimental results indicate that current MLLMs perform poorly when handling image‑based key information.
Description
Dataset Overview
Basic Information
- License: MIT
- Task Category: Question Answering
- Language: English
- Size: 10K < n < 100K
Configuration Details
- Config Name: val
- Data Files:
- Split: val
- Path: mm_niah_val/annotations/reasoning-text.jsonl
- Split: test
- Path: mm_niah_test/annotations/reasoning-text.jsonl
- Split: val
Dataset Introduction
- Name: Needle In A Multimodal Haystack (MM‑NIAH)
- Purpose: Evaluate the ability of existing multimodal large language models (MLLMs) to understand long multimodal documents.
- Task Types: Consist of three tasks, namely
retrieval,counting, andreasoning. - Data Structure: The “needles” inserted into documents can be either text or images, referred to as
text needlesandimage needlesrespectively.
Main Findings
- State‑of‑the‑art MLLMs (e.g., Gemini‑1.5) still struggle with understanding multimodal documents.
- All MLLMs perform poorly on image needles.
- MLLMs cannot accurately identify the number of images within a document.
- Models pretrained on image‑text interleaved data do not exhibit superior performance.
- Training on background documents does not improve performance on MM‑NIAH.
- An “intermediate loss” issue exists within MLLMs.
- Long‑context capability of LLMs is not retained in MLLMs.
- Retrieval‑augmented generation (RAG) improves performance on text‑needle retrieval but not on image‑needle retrieval.
- Placing the question before the context does not boost model performance.
- Human performance on MM‑NIAH is near‑perfect.
Experimental Results
- Evaluation Metrics:
- Retrieval and Reasoning Tasks: Accuracy.
- Counting Task: Soft Accuracy, defined as $\frac{1}{N} \sum_{i=1}^{N} \frac{m_i}{M_i}$, where $m_i$ is the number of matched elements between the predicted list and the ground‑truth list at the corresponding position, and $M_i$ is the number of elements in the ground‑truth list for sample $i$.
Evaluation Procedure
- Score Computation:
- Prepare model responses in JSONL format, then run the script
calculate_scores.pyto obtain heatmaps and scores. - Example command:
python calculate_scores.py --outputs-dir /path/to/your/responses
- Prepare model responses in JSONL format, then run the script
Data Format
- Data Structure:
- id: Integer starting from 0; each task type has an independent ID.
- images_list: List of length N, each element being the relative path of an image.
- context: Multimodal document, using
<image>as an image placeholder. - question: The question.
- answer: Ground‑truth answer, which may be a string, integer, or list.
- meta: Records various statistics, including insertion depth, context length, token counts for text and images, number of images, inserted needles, candidate textual and visual answers, etc.
Notes
- Note 1: The number of
<image>tokens in the context and question equals the length ofimages_list. - Note 2: Save as JSONL files, each line being a dictionary.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.