Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingMath Problem Solving

allenai/math_qa

We introduce a large‑scale dataset of mathematical word problems. By annotating the AQuA‑RAT dataset with a novel representation language, we generate fully specified procedural programs. AQuA‑RAT provides the problem, options, rationale, and correct answer.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 18, 2024
Signals
567 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Summary

  • Name: MathQA
  • Language: English
  • Creator: Crowdsourced and expert generated
  • License: Apache‑2.0
  • Multilinguality: Monolingual
  • Size: 10 K < n < 100 K
  • Source Dataset: Extended from AQuA‑RAT
  • Task Type: Question Answering
  • Task ID: Multiple‑choice QA
  • Paper ID: mathqa

Data Structure

Data Instances

An example from the training set:

{
    "Problem": "a multiple choice test consists of 4 questions , and each question has 5 answer choices . in how many r ways can the test be completed if every question is unanswered ?",
    "Rationale": "\"5 choices for each of the 4 questions , thus total r of 5 * 5 * 5 * 5 = 5 ^ 4 = 625 ways to answer all of them . answer : c .\"",
    "annotated_formula": "power(5, 4)",
    "category": "general",
    "correct": "c",
    "linear_formula": "power(n1,n0)|",
    "options": "a ) 24 , b ) 120 , c ) 625 , d ) 720 , e ) 1024"
}

Data Fields

  • Problem: problem description (string)
  • Rationale: reasoning process (string)
  • options: answer options (string)
  • correct: correct answer label (string)
  • annotated_formula: annotated formula (string)
  • linear_formula: linear formula (string)
  • category: category label (string)

Data Splits

SplitTrainValidationTest
Size29,8374,4752,985

Dataset Creation

Dataset Information

  • Download Size: 7,302,821 bytes
  • Dataset Size: 22,965,979 bytes

Split Details

  • Test Set: 1,844,184 bytes, 2,985 samples
  • Train Set: 18,368,826 bytes, 29,837 samples
  • Validation Set: 2,752,969 bytes, 4,475 samples

License Information

The dataset follows the Apache License, Version 2.0.

Citation

@inproceedings{amini-etal-2019-mathqa,
    title = "{M}ath{QA}: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms",
    author = "Amini, Aida  and
      Gabriel, Saadia  and
      Lin, Shanchuan  and
      Koncel-Kedziorski, Rik  and
      Choi, Yejin  and
      Hajishirzi, Hannaneh",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/N19-1245",
    doi = "10.18653/v1/N19-1245",
    pages = "2357--2367",
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio