MMLU
Model EvaluationMultidisciplinary Learning
This dataset is a benchmark for evaluating language‑model performance across a range of tasks. It is also used to assess models that have been fine‑tuned on multiple tasks.
Source arXivUpdated Apr 28, 2026648 viewsLinked
Inspect dataset