DATASET
Open Source Community
MMLU
This dataset is a benchmark for evaluating language‑model performance across a range of tasks. It is also used to assess models that have been fine‑tuned on multiple tasks.
Updated N/A
arXiv
Description
Dataset Overview
Basic Information
- Dataset Name: Measuring Massive Multitask Language Understanding
- Release Year: 2021
- Conference: International Conference on Learning Representations (ICLR)
- Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
- Paper: Measuring Massive Multitask Language Understanding
- Download: https://people.eecs.berkeley.edu/~hendrycks/data.tar
Content
- Purpose: Evaluate large‑scale multitask language understanding capabilities.
- Includes: OpenAI API evaluation code and test data.
Test Results
- Models and Scores:
| Model | Authors | Humanities | Social Sciences | STEM | Others | Average |
|---|---|---|---|---|---|---|
| Chinchilla (70B, few‑shot) | Hoffmann et al., 2022 | 63.6 | 79.3 | 54.9 | 73.9 | 67.5 |
| Gopher (280B, few‑shot) | Rae et al., 2021 | 56.2 | 71.9 | 47.4 | 66.1 | 60.0 |
| GPT‑3 (175B, fine‑tuned) | Brown et al., 2020 | 52.5 | 63.9 | 41.4 | 57.9 | 53.9 |
| flan‑T5‑xl | Chung et al., 2022 | 46.3 | 57.7 | 39.0 | 55.1 | 49.3 |
| UnifiedQA | Khashabi et al., 2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9 |
| GPT‑3 (175B, few‑shot) | Brown et al., 2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9 |
| GPT‑3 (6.7B, fine‑tuned) | Brown et al., 2020 | 42.1 | 49.2 | 35.1 | 46.9 | 43.2 |
| flan‑T5‑large | Chung et al., 2022 | 39.1 | 49.1 | 33.2 | 47.4 | 41.9 |
| flan‑T5‑base | Chung et al., 2022 | 34.0 | 38.1 | 27.6 | 37.0 | 34.2 |
| GPT‑2 | Radford et al., 2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4 |
| flan‑T5‑small | Chung et al., 2022 | 29.9 | 30.9 | 27.5 | 29.7 | 29.5 |
| Random Baseline | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
Citation Information
- Primary Dataset Citation:
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
- Related Dataset Citation:
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Model Evaluation
Multidisciplinary Learning
Source
Organization: arXiv
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.