MM-Vet v2

The MM‑Vet v2 dataset was jointly created by the National University of Singapore, Microsoft, and Advanced Micro Devices to evaluate the comprehensive capabilities of large multimodal models. It comprises 517 high‑quality evaluation samples covering a wide range of scenarios from everyday life to professional/industrial applications. The creation process involved researchers designing questions and collecting reference answers, ensuring high quality and broad applicability. MM‑Vet v2 specifically introduces an "image‑text sequence understanding" ability to assess a model's capacity to handle combined image and text‑sequence data, addressing complex task handling in real‑world multimodal applications.

Updated 8/2/2024

arXiv

Description

MM‑Vet Dataset Overview

Dataset Introduction

MM‑Vet is used to evaluate the integrated capabilities of large multimodal models, covering core visual‑language abilities such as recognition, OCR, knowledge, language generation, spatial perception, and mathematics.

Dataset Versions

MM‑Vet v2: Extends MM‑Vet with the added "image‑text sequence understanding" ability, enlarges the evaluation set while maintaining high quality.

Dataset Download

The dataset can be downloaded from the following link: Download Dataset

Dataset Evaluation

Evaluation Steps

Install dependencies: Use pip install openai>=1 to install the OpenAI package and obtain access to the GPT‑4/GPT‑3.5 API.
Download dataset: Download and unzip the dataset from the link above.
Model inference: Run the provided inference script and save results in JSON format.
Evaluate model: Use the supplied evaluation script to assess model outputs.

Inference Script Examples

image_detail=high # or auto, low (see https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding)

python inference/gpt4v.py --mmvet_path /path/to/mm-vet --image_detail ${image_detail}

python inference/gemini_vision.py --mmvet_path /path/to/mm-vet

Evaluation Script Example

python mm-vet_evaluator.py --mmvet_path /path/to/mm-vet --result_file results/llava_llama2_13b_chat.json

Dataset Samples

The dataset contains multiple samples, each comprising a question, an answer, and the required visual‑language capabilities. Below are some example samples:

Sample 1

Q: What occasions would someone use this meme? GT: This meme, commonly known as "Screaming Panda," is typically used to express shock, surprise, or fear. Required capabilities: Recognition, knowledge, language generation

Sample 2

Q: How many tomatoes are there? GT: 5 Required capabilities: Recognition

... (additional samples omitted for brevity) ...

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Multimodal Models

Image‑Text Sequence Understanding

Source

Organization: arXiv

Created: 8/2/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →