Dataset assetOpen Source CommunityModel EvaluationMultidisciplinary Learning

MMLU

This dataset is a benchmark for evaluating language‑model performance across a range of tasks. It is also used to assess models that have been fine‑tuned on multiple tasks.

Source

arXiv

Created

Nov 28, 2025

Updated

Apr 28, 2026

Signals

647 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Basic Information

Dataset Name: Measuring Massive Multitask Language Understanding
Release Year: 2021
Conference: International Conference on Learning Representations (ICLR)
Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
Paper: Measuring Massive Multitask Language Understanding
Download: https://people.eecs.berkeley.edu/~hendrycks/data.tar

Content

Purpose: Evaluate large‑scale multitask language understanding capabilities.
Includes: OpenAI API evaluation code and test data.

Test Results

Models and Scores:

Model	Authors	Humanities	Social Sciences	STEM	Others	Average
Chinchilla (70B, few‑shot)	Hoffmann et al., 2022	63.6	79.3	54.9	73.9	67.5
Gopher (280B, few‑shot)	Rae et al., 2021	56.2	71.9	47.4	66.1	60.0
GPT‑3 (175B, fine‑tuned)	Brown et al., 2020	52.5	63.9	41.4	57.9	53.9
flan‑T5‑xl	Chung et al., 2022	46.3	57.7	39.0	55.1	49.3
UnifiedQA	Khashabi et al., 2020	45.6	56.6	40.2	54.6	48.9
GPT‑3 (175B, few‑shot)	Brown et al., 2020	40.8	50.4	36.7	48.8	43.9
GPT‑3 (6.7B, fine‑tuned)	Brown et al., 2020	42.1	49.2	35.1	46.9	43.2
flan‑T5‑large	Chung et al., 2022	39.1	49.1	33.2	47.4	41.9
flan‑T5‑base	Chung et al., 2022	34.0	38.1	27.6	37.0	34.2
GPT‑2	Radford et al., 2019	32.8	33.3	30.2	33.1	32.4
flan‑T5‑small	Chung et al., 2022	29.9	30.9	27.5	29.7	29.5
Random Baseline	N/A	25.0	25.0	25.0	25.0	25.0

Citation Information

Primary Dataset Citation:

@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

Related Dataset Citation:

@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio