Back to datasets
Dataset assetOpen Source CommunityModel EvaluationChat Large Language Models

A-Eval

A-Eval is a benchmark for evaluating chat large language models (LLMs) of various scales from an application-driven perspective. The dataset contains 678 question-answer pairs spanning 5 categories, 27 sub-categories, and 3 difficulty levels. A-Eval provides clear empirical and engineering guidelines for selecting the "best" model for real-world applications.

Source
github
Created
Aug 9, 2024
Updated
Aug 9, 2024
Signals
87 views
Availability
Linked source ready
Overview

Dataset description and usage context

A-Eval Dataset Overview

Introduction

A‑Eval is a benchmark that evaluates chat LLMs of different scales from an application‑driven perspective. The dataset contains 678 question‑answer pairs covering 5 categories, 27 sub‑categories, and 3 difficulty levels. A‑Eval offers explicit empirical and engineering guidance for choosing the most suitable model for practical use.

Application‑Driven Task Classification

  • 678 question‑answer pairs
  • 5 categories, 27 sub‑categories
  • 3 difficulty levels

Evaluation Results

Based on QWen1.5-72B-Chat, we designed an automatic evaluation method to assess eight models of varying scales. Additional expert evaluation validated the reliability of the automatic method.

Average Accuracy

We present the average accuracy of models of different scales on A‑Eval.

  • (a) Average accuracy of models of different scales across all tasks and difficulty levels. Dashed lines represent expert evaluation results, solid lines represent automatic evaluation results with different scoring thresholds T.
  • (b) Average accuracy of models of different scales on easy, medium, and hard data. Dashed lines represent expert evaluation results, solid lines represent automatic evaluation results using thresholds 90 and 60.

Task‑wise Accuracy

For each specific task and its sub‑tasks, we show the average accuracy of models of different scales.

  • (a) Accuracy when T = 60.
  • (b) Accuracy when T = 90.

Model Selection

The best model is defined as the smallest‑scale model that meets the required accuracy. Using the evaluation results, users can easily identify the best model by drawing a horizontal line on the performance chart.

Citation

If you use our benchmark or dataset in your research, please cite our paper.

bash @misc{lian2024best, title={What is the best model? Application-driven Evaluation for Large Language Models}, author={Shiguo Lian and Kaikai Zhao and Xinhui Liu and Xuejiao Lei and Bikun Yang and Wenjing Zhang and Kai Wang and Zhaoxiang Liu}, year={2024}, eprint={2406.10307}, archivePrefix={arXiv}, }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio