A-Eval
A-Eval is a benchmark for evaluating chat large language models (LLMs) of various scales from an application-driven perspective. The dataset contains 678 question-answer pairs spanning 5 categories, 27 sub-categories, and 3 difficulty levels. A-Eval provides clear empirical and engineering guidelines for selecting the "best" model for real-world applications.
Description
A-Eval Dataset Overview
Introduction
A‑Eval is a benchmark that evaluates chat LLMs of different scales from an application‑driven perspective. The dataset contains 678 question‑answer pairs covering 5 categories, 27 sub‑categories, and 3 difficulty levels. A‑Eval offers explicit empirical and engineering guidance for choosing the most suitable model for practical use.
Application‑Driven Task Classification
678question‑answer pairs5categories,27sub‑categories3difficulty levels
Evaluation Results
Based on QWen1.5-72B-Chat, we designed an automatic evaluation method to assess eight models of varying scales. Additional expert evaluation validated the reliability of the automatic method.
Average Accuracy
We present the average accuracy of models of different scales on A‑Eval.
- (a) Average accuracy of models of different scales across all tasks and difficulty levels. Dashed lines represent expert evaluation results, solid lines represent automatic evaluation results with different scoring thresholds T.
- (b) Average accuracy of models of different scales on easy, medium, and hard data. Dashed lines represent expert evaluation results, solid lines represent automatic evaluation results using thresholds 90 and 60.
Task‑wise Accuracy
For each specific task and its sub‑tasks, we show the average accuracy of models of different scales.
- (a) Accuracy when T = 60.
- (b) Accuracy when T = 90.
Model Selection
The best model is defined as the smallest‑scale model that meets the required accuracy. Using the evaluation results, users can easily identify the best model by drawing a horizontal line on the performance chart.
Citation
If you use our benchmark or dataset in your research, please cite our paper.
bash @misc{lian2024best, title={What is the best model? Application-driven Evaluation for Large Language Models}, author={Shiguo Lian and Kaikai Zhao and Xinhui Liu and Xuejiao Lei and Bikun Yang and Wenjing Zhang and Kai Wang and Zhaoxiang Liu}, year={2024}, eprint={2406.10307}, archivePrefix={arXiv}, }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 8/9/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.