Back to datasets
Dataset assetOpen Source CommunityMultimodal LearningMathematical Reasoning

mathvision

Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.

Source
github
Created
Feb 17, 2024
Updated
Feb 24, 2024
Signals
465 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

  • MATH‑Vision (MATH‑V) Dataset

Description

  • MATH‑Vision (MATH‑V) Dataset is a collection of 3,040 high‑quality mathematics problems with visual context, sourced from real competition problems. The dataset spans 16 mathematical domains and is stratified into five difficulty levels, providing a comprehensive benchmark for evaluating large multimodal models (LMMs) on mathematical reasoning.

Features

  • Multimodal Mathematical Reasoning: Designed to assess models’ ability to reason mathematically with visual inputs.
  • Broad Topic Coverage: Includes 16 domains such as analytic geometry, topology, and graph theory.
  • Multiple Difficulty Levels: Problems are categorized into five levels from easy to hard.

Usage

  • Model Evaluation: Used to evaluate models like GPT‑4, GPT‑4V, Gemini, etc., on mathematical reasoning tasks.
  • Research Tool: Provides evaluation code and data to support further research in multimodal mathematical reasoning.

Access

Related Work

  • Paper: Details of dataset construction and evaluation can be found on arXiv.

Example

  • Sample Content: Includes specific problems from fields such as analytic geometry, topology, and graph theory. Detailed examples are provided in Appendix D.3 of the paper.

Evaluation & Results

  • Model Performance: As of the latest update, GPT‑4o scores 30.39 % on MATH‑V, while human performance is around 70 %.
  • Evaluation Tools: Scripts are provided to compute accuracy and performance across disciplines and difficulty levels.

Citation

  • BibTeX:
    @misc{wang2024measuring,
          title={Measuring Multimodal Mathematical Reasoning with MATH‑Vision Dataset},
          author={Ke Wang and Junting Pan and Weikang Shi and Zimu Lu and Mingjie Zhan and Hongsheng Li},
          year={2024},
          eprint={2402.14804},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
    }
    
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio