Dataset assetOpen Source CommunityNatural Language ProcessingMath Problem Solving

allenai/math_qa

We introduce a large‑scale dataset of mathematical word problems. By annotating the AQuA‑RAT dataset with a novel representation language, we generate fully specified procedural programs. AQuA‑RAT provides the problem, options, rationale, and correct answer.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jan 18, 2024

Signals

567 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Summary

Name: MathQA
Language: English
Creator: Crowdsourced and expert generated
License: Apache‑2.0
Multilinguality: Monolingual
Size: 10 K < n < 100 K
Source Dataset: Extended from AQuA‑RAT
Task Type: Question Answering
Task ID: Multiple‑choice QA
Paper ID: mathqa

Data Structure

Data Instances

An example from the training set:

{
    "Problem": "a multiple choice test consists of 4 questions , and each question has 5 answer choices . in how many r ways can the test be completed if every question is unanswered ?",
    "Rationale": "\"5 choices for each of the 4 questions , thus total r of 5 * 5 * 5 * 5 = 5 ^ 4 = 625 ways to answer all of them . answer : c .\"",
    "annotated_formula": "power(5, 4)",
    "category": "general",
    "correct": "c",
    "linear_formula": "power(n1,n0)|",
    "options": "a ) 24 , b ) 120 , c ) 625 , d ) 720 , e ) 1024"
}

Data Fields

Problem: problem description (string)
Rationale: reasoning process (string)
options: answer options (string)
correct: correct answer label (string)
annotated_formula: annotated formula (string)
linear_formula: linear formula (string)
category: category label (string)

Data Splits

Split	Train	Validation	Test
Size	29,837	4,475	2,985

Dataset Creation

Dataset Information

Download Size: 7,302,821 bytes
Dataset Size: 22,965,979 bytes

Split Details

Test Set: 1,844,184 bytes, 2,985 samples
Train Set: 18,368,826 bytes, 29,837 samples
Validation Set: 2,752,969 bytes, 4,475 samples

License Information

The dataset follows the Apache License, Version 2.0.

Citation

@inproceedings{amini-etal-2019-mathqa,
    title = "{M}ath{QA}: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms",
    author = "Amini, Aida  and
      Gabriel, Saadia  and
      Lin, Shanchuan  and
      Koncel-Kedziorski, Rik  and
      Choi, Yejin  and
      Hajishirzi, Hannaneh",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/N19-1245",
    doi = "10.18653/v1/N19-1245",
    pages = "2357--2367",
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio