JUHE API Marketplace
DATASET
Open Source Community

MMLU

This dataset is a benchmark for evaluating language‑model performance across a range of tasks. It is also used to assess models that have been fine‑tuned on multiple tasks.

Updated N/A
arXiv

Description

Dataset Overview

Basic Information

Content

  • Purpose: Evaluate large‑scale multitask language understanding capabilities.
  • Includes: OpenAI API evaluation code and test data.

Test Results

  • Models and Scores:
ModelAuthorsHumanitiesSocial SciencesSTEMOthersAverage
Chinchilla (70B, few‑shot)Hoffmann et al., 202263.679.354.973.967.5
Gopher (280B, few‑shot)Rae et al., 202156.271.947.466.160.0
GPT‑3 (175B, fine‑tuned)Brown et al., 202052.563.941.457.953.9
flan‑T5‑xlChung et al., 202246.357.739.055.149.3
UnifiedQAKhashabi et al., 202045.656.640.254.648.9
GPT‑3 (175B, few‑shot)Brown et al., 202040.850.436.748.843.9
GPT‑3 (6.7B, fine‑tuned)Brown et al., 202042.149.235.146.943.2
flan‑T5‑largeChung et al., 202239.149.133.247.441.9
flan‑T5‑baseChung et al., 202234.038.127.637.034.2
GPT‑2Radford et al., 201932.833.330.233.132.4
flan‑T5‑smallChung et al., 202229.930.927.529.729.5
Random BaselineN/A25.025.025.025.025.0

Citation Information

  • Primary Dataset Citation:
@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}
  • Related Dataset Citation:
@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Model Evaluation
Multidisciplinary Learning

Source

Organization: arXiv

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.