Dataset assetOpen Source CommunityMultimodal LearningMathematical Reasoning

mathvision

Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.

Source

github

Created

Feb 17, 2024

Updated

Feb 24, 2024

Signals

465 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

MATH‑Vision (MATH‑V) Dataset

Description

MATH‑Vision (MATH‑V) Dataset is a collection of 3,040 high‑quality mathematics problems with visual context, sourced from real competition problems. The dataset spans 16 mathematical domains and is stratified into five difficulty levels, providing a comprehensive benchmark for evaluating large multimodal models (LMMs) on mathematical reasoning.

Features

Multimodal Mathematical Reasoning: Designed to assess models’ ability to reason mathematically with visual inputs.
Broad Topic Coverage: Includes 16 domains such as analytic geometry, topology, and graph theory.
Multiple Difficulty Levels: Problems are categorized into five levels from easy to hard.

Usage

Model Evaluation: Used to evaluate models like GPT‑4, GPT‑4V, Gemini, etc., on mathematical reasoning tasks.
Research Tool: Provides evaluation code and data to support further research in multimodal mathematical reasoning.

Access

Dataset Link: Available via HuggingFace.

Related Work

Paper: Details of dataset construction and evaluation can be found on arXiv.

Example

Sample Content: Includes specific problems from fields such as analytic geometry, topology, and graph theory. Detailed examples are provided in Appendix D.3 of the paper.

Evaluation & Results

Model Performance: As of the latest update, GPT‑4o scores 30.39 % on MATH‑V, while human performance is around 70 %.
Evaluation Tools: Scripts are provided to compute accuracy and performance across disciplines and difficulty levels.

Citation

BibTeX:

@misc{wang2024measuring,
      title={Measuring Multimodal Mathematical Reasoning with MATH‑Vision Dataset},
      author={Ke Wang and Junting Pan and Weikang Shi and Zimu Lu and Mingjie Zhan and Hongsheng Li},
      year={2024},
      eprint={2402.14804},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio