A-Eval
Model EvaluationChat Large Language Models
A-Eval is a benchmark for evaluating chat large language models (LLMs) of various scales from an application-driven perspective. The dataset contains 678 question-answer pairs spanning 5 categories, 27 sub-categories, and 3 difficulty levels. A-Eval provides clear empirical and engineering guidelines for selecting the "best" model for real-world applications.
Source githubUpdated Aug 9, 202486 viewsLinked
Inspect dataset