A-Eval
A-Eval is a benchmark for evaluating chat large language models (LLMs) of various scales from an application-driven perspective. The dataset contains 678 question-answer pairs spanning 5 categories, 27 sub-categories, and 3 difficulty levels. A-Eval provides clear empirical and engineering guidelines for selecting the "best" model for real-world applications.
Dataset description and usage context
A-Eval Dataset Overview
Introduction
A‑Eval is a benchmark that evaluates chat LLMs of different scales from an application‑driven perspective. The dataset contains 678 question‑answer pairs covering 5 categories, 27 sub‑categories, and 3 difficulty levels. A‑Eval offers explicit empirical and engineering guidance for choosing the most suitable model for practical use.
Application‑Driven Task Classification
678question‑answer pairs5categories,27sub‑categories3difficulty levels
Evaluation Results
Based on QWen1.5-72B-Chat, we designed an automatic evaluation method to assess eight models of varying scales. Additional expert evaluation validated the reliability of the automatic method.
Average Accuracy
We present the average accuracy of models of different scales on A‑Eval.
- (a) Average accuracy of models of different scales across all tasks and difficulty levels. Dashed lines represent expert evaluation results, solid lines represent automatic evaluation results with different scoring thresholds T.
- (b) Average accuracy of models of different scales on easy, medium, and hard data. Dashed lines represent expert evaluation results, solid lines represent automatic evaluation results using thresholds 90 and 60.
Task‑wise Accuracy
For each specific task and its sub‑tasks, we show the average accuracy of models of different scales.
- (a) Accuracy when T = 60.
- (b) Accuracy when T = 90.
Model Selection
The best model is defined as the smallest‑scale model that meets the required accuracy. Using the evaluation results, users can easily identify the best model by drawing a horizontal line on the performance chart.
Citation
If you use our benchmark or dataset in your research, please cite our paper.
bash @misc{lian2024best, title={What is the best model? Application-driven Evaluation for Large Language Models}, author={Shiguo Lian and Kaikai Zhao and Xinhui Liu and Xuejiao Lei and Bikun Yang and Wenjing Zhang and Kai Wang and Zhaoxiang Liu}, year={2024}, eprint={2406.10307}, archivePrefix={arXiv}, }
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.