Explore high-quality datasets for your AI and machine learning projects.
MathInstruct is a carefully curated instruction‑tuning dataset that is lightweight yet versatile. It aggregates 13 math reasoning datasets, six of which are newly curated in this work. The dataset uniquely focuses on a mix of chain‑of‑thought (CoT) and program‑of‑thought (PoT) reasoning, ensuring broad coverage across mathematical domains. It is used for text generation tasks, primarily in English, with sizes ranging from 100 k to 1 M examples. It is associated with models based on Llama‑2 and Code Llama, ranging from 7 B to 70 B parameters. License information for each subset is provided.