MM-Vet v2
The MM‑Vet v2 dataset was jointly created by the National University of Singapore, Microsoft, and Advanced Micro Devices to evaluate the comprehensive capabilities of large multimodal models. It comprises 517 high‑quality evaluation samples covering a wide range of scenarios from everyday life to professional/industrial applications. The creation process involved researchers designing questions and collecting reference answers, ensuring high quality and broad applicability. MM‑Vet v2 specifically introduces an "image‑text sequence understanding" ability to assess a model's capacity to handle combined image and text‑sequence data, addressing complex task handling in real‑world multimodal applications.
Description
MM‑Vet Dataset Overview
Dataset Introduction
MM‑Vet is used to evaluate the integrated capabilities of large multimodal models, covering core visual‑language abilities such as recognition, OCR, knowledge, language generation, spatial perception, and mathematics.
Dataset Versions
- MM‑Vet v2: Extends MM‑Vet with the added "image‑text sequence understanding" ability, enlarges the evaluation set while maintaining high quality.
Dataset Download
The dataset can be downloaded from the following link: Download Dataset
Dataset Evaluation
Evaluation Steps
- Install dependencies: Use
pip install openai>=1to install the OpenAI package and obtain access to the GPT‑4/GPT‑3.5 API. - Download dataset: Download and unzip the dataset from the link above.
- Model inference: Run the provided inference script and save results in JSON format.
- Evaluate model: Use the supplied evaluation script to assess model outputs.
Inference Script Examples
image_detail=high # or auto, low (see https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding)
python inference/gpt4v.py --mmvet_path /path/to/mm-vet --image_detail ${image_detail}
python inference/gemini_vision.py --mmvet_path /path/to/mm-vet
Evaluation Script Example
python mm-vet_evaluator.py --mmvet_path /path/to/mm-vet --result_file results/llava_llama2_13b_chat.json
Dataset Samples
The dataset contains multiple samples, each comprising a question, an answer, and the required visual‑language capabilities. Below are some example samples:
Sample 1
Q: What occasions would someone use this meme? GT: This meme, commonly known as "Screaming Panda," is typically used to express shock, surprise, or fear. Required capabilities: Recognition, knowledge, language generation
Sample 2
Q: How many tomatoes are there? GT: 5 Required capabilities: Recognition
... (additional samples omitted for brevity) ...
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 8/2/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.