Back to datasets
Dataset assetOpen Source CommunityModel TrainingMathematics

agicorp/MathInstruct

MathInstruct is a carefully curated instruction‑tuning dataset that is lightweight yet versatile. It aggregates 13 math reasoning datasets, six of which are newly curated in this work. The dataset uniquely focuses on a mix of chain‑of‑thought (CoT) and program‑of‑thought (PoT) reasoning, ensuring broad coverage across mathematical domains. It is used for text generation tasks, primarily in English, with sizes ranging from 100 k to 1 M examples. It is associated with models based on Llama‑2 and Code Llama, ranging from 7 B to 70 B parameters. License information for each subset is provided.

Source
hugging_face
Created
Nov 28, 2025
Updated
Mar 23, 2024
Signals
138 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Name: MathInstruct

License: MIT

Task Category: Text Generation

Language: English

Size Category: 100 k–1 M examples

Tags: Mathematics

Dataset Details

  • Source: MathInstruct aggregates 13 math reasoning datasets, six of which are newly compiled in this work.
  • Features: Emphasizes a hybrid of chain‑of‑thought (CoT) and program‑of‑thought (PoT) reasoning, covering a wide range of mathematical fields.
  • Models:
    • Base Models: Llama‑2 and Code Llama
    • Model Variants:
      • 7B: MAmmoTH‑7B, MAmmoTH‑Coder‑7B
      • 13B: MAmmoTH‑13B, MAmmoTH‑Coder‑13B
      • 34B: MAmmoTH‑Coder‑34B
      • 70B: MAmmoTH‑70B

License Details

  • GSM8K: MIT
  • GSM8K‑RFT: Not listed
  • AQuA‑RAT: Apache 2.0
  • MATH: MIT
  • TheoremQA: MIT
  • Camel‑Math: Attribution‑NonCommercial 4.0 International
  • NumGLUE: Apache‑2.0
  • MathQA: Apache‑2.0
  • Our Curated: MIT

Citation

@article{yue2023mammoth,
  title={MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning},
  author={Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen},
  journal={arXiv preprint arXiv:2309.05653},
  year={2023}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio