JUHE API Marketplace
DATASET
Open Source Community

OpenGVLab/MM-NIAH

Needle In A Multimodal Haystack (MM‑NIAH) is a comprehensive benchmark designed to systematically evaluate the capability of existing multimodal large language models (MLLMs) in understanding long multimodal documents. The benchmark requires models to answer specific questions based on key information scattered throughout multimodal documents. MM‑NIAH's evaluation data comprises three tasks: retrieval, counting, and reasoning. Key information (called “needles”) is embedded in the document's text or images; those inserted into text are referred to as text needles, and those inserted into images as image needles. Experimental results indicate that current MLLMs perform poorly when handling image‑based key information.

Updated 6/17/2024
hugging_face

Description

Dataset Overview

Basic Information

  • License: MIT
  • Task Category: Question Answering
  • Language: English
  • Size: 10K < n < 100K

Configuration Details

  • Config Name: val
  • Data Files:
    • Split: val
      • Path: mm_niah_val/annotations/reasoning-text.jsonl
    • Split: test
      • Path: mm_niah_test/annotations/reasoning-text.jsonl

Dataset Introduction

  • Name: Needle In A Multimodal Haystack (MM‑NIAH)
  • Purpose: Evaluate the ability of existing multimodal large language models (MLLMs) to understand long multimodal documents.
  • Task Types: Consist of three tasks, namely retrieval, counting, and reasoning.
  • Data Structure: The “needles” inserted into documents can be either text or images, referred to as text needles and image needles respectively.

Main Findings

  • State‑of‑the‑art MLLMs (e.g., Gemini‑1.5) still struggle with understanding multimodal documents.
  • All MLLMs perform poorly on image needles.
  • MLLMs cannot accurately identify the number of images within a document.
  • Models pretrained on image‑text interleaved data do not exhibit superior performance.
  • Training on background documents does not improve performance on MM‑NIAH.
  • An “intermediate loss” issue exists within MLLMs.
  • Long‑context capability of LLMs is not retained in MLLMs.
  • Retrieval‑augmented generation (RAG) improves performance on text‑needle retrieval but not on image‑needle retrieval.
  • Placing the question before the context does not boost model performance.
  • Human performance on MM‑NIAH is near‑perfect.

Experimental Results

  • Evaluation Metrics:
    • Retrieval and Reasoning Tasks: Accuracy.
    • Counting Task: Soft Accuracy, defined as $\frac{1}{N} \sum_{i=1}^{N} \frac{m_i}{M_i}$, where $m_i$ is the number of matched elements between the predicted list and the ground‑truth list at the corresponding position, and $M_i$ is the number of elements in the ground‑truth list for sample $i$.

Evaluation Procedure

  • Score Computation:
    • Prepare model responses in JSONL format, then run the script calculate_scores.py to obtain heatmaps and scores.
    • Example command: python calculate_scores.py --outputs-dir /path/to/your/responses

Data Format

  • Data Structure:
    • id: Integer starting from 0; each task type has an independent ID.
    • images_list: List of length N, each element being the relative path of an image.
    • context: Multimodal document, using <image> as an image placeholder.
    • question: The question.
    • answer: Ground‑truth answer, which may be a string, integer, or list.
    • meta: Records various statistics, including insertion depth, context length, token counts for text and images, number of images, inserted needles, candidate textual and visual answers, etc.

Notes

  • Note 1: The number of <image> tokens in the context and question equals the length of images_list.
  • Note 2: Save as JSONL files, each line being a dictionary.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Multimodal Learning
Question Answering Systems

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.