mathvision
Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.
Description
Dataset Overview
Dataset Name
- MATH‑Vision (MATH‑V) Dataset
Description
- MATH‑Vision (MATH‑V) Dataset is a collection of 3,040 high‑quality mathematics problems with visual context, sourced from real competition problems. The dataset spans 16 mathematical domains and is stratified into five difficulty levels, providing a comprehensive benchmark for evaluating large multimodal models (LMMs) on mathematical reasoning.
Features
- Multimodal Mathematical Reasoning: Designed to assess models’ ability to reason mathematically with visual inputs.
- Broad Topic Coverage: Includes 16 domains such as analytic geometry, topology, and graph theory.
- Multiple Difficulty Levels: Problems are categorized into five levels from easy to hard.
Usage
- Model Evaluation: Used to evaluate models like GPT‑4, GPT‑4V, Gemini, etc., on mathematical reasoning tasks.
- Research Tool: Provides evaluation code and data to support further research in multimodal mathematical reasoning.
Access
- Dataset Link: Available via HuggingFace.
Related Work
- Paper: Details of dataset construction and evaluation can be found on arXiv.
Example
- Sample Content: Includes specific problems from fields such as analytic geometry, topology, and graph theory. Detailed examples are provided in Appendix D.3 of the paper.
Evaluation & Results
- Model Performance: As of the latest update, GPT‑4o scores 30.39 % on MATH‑V, while human performance is around 70 %.
- Evaluation Tools: Scripts are provided to compute accuracy and performance across disciplines and difficulty levels.
Citation
- BibTeX:
@misc{wang2024measuring, title={Measuring Multimodal Mathematical Reasoning with MATH‑Vision Dataset}, author={Ke Wang and Junting Pan and Weikang Shi and Zimu Lu and Mingjie Zhan and Hongsheng Li}, year={2024}, eprint={2402.14804}, archivePrefix={arXiv}, primaryClass={cs.CV} }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 2/17/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.