Back to datasets
Dataset assetOpen Source CommunityModel EvaluationMultidisciplinary Learning

MMLU

This dataset is a benchmark for evaluating language‑model performance across a range of tasks. It is also used to assess models that have been fine‑tuned on multiple tasks.

Source
arXiv
Created
Nov 28, 2025
Updated
Apr 28, 2026
Signals
647 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

Content

  • Purpose: Evaluate large‑scale multitask language understanding capabilities.
  • Includes: OpenAI API evaluation code and test data.

Test Results

  • Models and Scores:
ModelAuthorsHumanitiesSocial SciencesSTEMOthersAverage
Chinchilla (70B, few‑shot)Hoffmann et al., 202263.679.354.973.967.5
Gopher (280B, few‑shot)Rae et al., 202156.271.947.466.160.0
GPT‑3 (175B, fine‑tuned)Brown et al., 202052.563.941.457.953.9
flan‑T5‑xlChung et al., 202246.357.739.055.149.3
UnifiedQAKhashabi et al., 202045.656.640.254.648.9
GPT‑3 (175B, few‑shot)Brown et al., 202040.850.436.748.843.9
GPT‑3 (6.7B, fine‑tuned)Brown et al., 202042.149.235.146.943.2
flan‑T5‑largeChung et al., 202239.149.133.247.441.9
flan‑T5‑baseChung et al., 202234.038.127.637.034.2
GPT‑2Radford et al., 201932.833.330.233.132.4
flan‑T5‑smallChung et al., 202229.930.927.529.729.5
Random BaselineN/A25.025.025.025.025.0

Citation Information

  • Primary Dataset Citation:
@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}
  • Related Dataset Citation:
@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio