Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingMath Problem Solving
allenai/math_qa
We introduce a large‑scale dataset of mathematical word problems. By annotating the AQuA‑RAT dataset with a novel representation language, we generate fully specified procedural programs. AQuA‑RAT provides the problem, options, rationale, and correct answer.
Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 18, 2024
Signals
567 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Summary
- Name: MathQA
- Language: English
- Creator: Crowdsourced and expert generated
- License: Apache‑2.0
- Multilinguality: Monolingual
- Size: 10 K < n < 100 K
- Source Dataset: Extended from AQuA‑RAT
- Task Type: Question Answering
- Task ID: Multiple‑choice QA
- Paper ID: mathqa
Data Structure
Data Instances
An example from the training set:
{
"Problem": "a multiple choice test consists of 4 questions , and each question has 5 answer choices . in how many r ways can the test be completed if every question is unanswered ?",
"Rationale": "\"5 choices for each of the 4 questions , thus total r of 5 * 5 * 5 * 5 = 5 ^ 4 = 625 ways to answer all of them . answer : c .\"",
"annotated_formula": "power(5, 4)",
"category": "general",
"correct": "c",
"linear_formula": "power(n1,n0)|",
"options": "a ) 24 , b ) 120 , c ) 625 , d ) 720 , e ) 1024"
}
Data Fields
Problem: problem description (string)Rationale: reasoning process (string)options: answer options (string)correct: correct answer label (string)annotated_formula: annotated formula (string)linear_formula: linear formula (string)category: category label (string)
Data Splits
| Split | Train | Validation | Test |
|---|---|---|---|
| Size | 29,837 | 4,475 | 2,985 |
Dataset Creation
Dataset Information
- Download Size: 7,302,821 bytes
- Dataset Size: 22,965,979 bytes
Split Details
- Test Set: 1,844,184 bytes, 2,985 samples
- Train Set: 18,368,826 bytes, 29,837 samples
- Validation Set: 2,752,969 bytes, 4,475 samples
License Information
The dataset follows the Apache License, Version 2.0.
Citation
@inproceedings{amini-etal-2019-mathqa,
title = "{M}ath{QA}: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms",
author = "Amini, Aida and
Gabriel, Saadia and
Lin, Shanchuan and
Koncel-Kedziorski, Rik and
Choi, Yejin and
Hajishirzi, Hannaneh",
booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N19-1245",
doi = "10.18653/v1/N19-1245",
pages = "2357--2367",
}
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.