Back to datasets
Dataset assetOpen Source CommunityModel EvaluationMultidisciplinary Learning
MMLU
This dataset is a benchmark for evaluating language‑model performance across a range of tasks. It is also used to assess models that have been fine‑tuned on multiple tasks.
Source
arXiv
Created
Nov 28, 2025
Updated
Apr 28, 2026
Signals
647 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Basic Information
- Dataset Name: Measuring Massive Multitask Language Understanding
- Release Year: 2021
- Conference: International Conference on Learning Representations (ICLR)
- Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
- Paper: Measuring Massive Multitask Language Understanding
- Download: https://people.eecs.berkeley.edu/~hendrycks/data.tar
Content
- Purpose: Evaluate large‑scale multitask language understanding capabilities.
- Includes: OpenAI API evaluation code and test data.
Test Results
- Models and Scores:
| Model | Authors | Humanities | Social Sciences | STEM | Others | Average |
|---|---|---|---|---|---|---|
| Chinchilla (70B, few‑shot) | Hoffmann et al., 2022 | 63.6 | 79.3 | 54.9 | 73.9 | 67.5 |
| Gopher (280B, few‑shot) | Rae et al., 2021 | 56.2 | 71.9 | 47.4 | 66.1 | 60.0 |
| GPT‑3 (175B, fine‑tuned) | Brown et al., 2020 | 52.5 | 63.9 | 41.4 | 57.9 | 53.9 |
| flan‑T5‑xl | Chung et al., 2022 | 46.3 | 57.7 | 39.0 | 55.1 | 49.3 |
| UnifiedQA | Khashabi et al., 2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9 |
| GPT‑3 (175B, few‑shot) | Brown et al., 2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9 |
| GPT‑3 (6.7B, fine‑tuned) | Brown et al., 2020 | 42.1 | 49.2 | 35.1 | 46.9 | 43.2 |
| flan‑T5‑large | Chung et al., 2022 | 39.1 | 49.1 | 33.2 | 47.4 | 41.9 |
| flan‑T5‑base | Chung et al., 2022 | 34.0 | 38.1 | 27.6 | 37.0 | 34.2 |
| GPT‑2 | Radford et al., 2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4 |
| flan‑T5‑small | Chung et al., 2022 | 29.9 | 30.9 | 27.5 | 29.7 | 29.5 |
| Random Baseline | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
Citation Information
- Primary Dataset Citation:
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
- Related Dataset Citation:
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.