JUHE API Marketplace
DATASET
Open Source Community

A-Eval

A-Eval is a benchmark for evaluating chat large language models (LLMs) of various scales from an application-driven perspective. The dataset contains 678 question-answer pairs spanning 5 categories, 27 sub-categories, and 3 difficulty levels. A-Eval provides clear empirical and engineering guidelines for selecting the "best" model for real-world applications.

Updated 8/9/2024
github

Description

A-Eval Dataset Overview

Introduction

A‑Eval is a benchmark that evaluates chat LLMs of different scales from an application‑driven perspective. The dataset contains 678 question‑answer pairs covering 5 categories, 27 sub‑categories, and 3 difficulty levels. A‑Eval offers explicit empirical and engineering guidance for choosing the most suitable model for practical use.

Application‑Driven Task Classification

  • 678 question‑answer pairs
  • 5 categories, 27 sub‑categories
  • 3 difficulty levels

Evaluation Results

Based on QWen1.5-72B-Chat, we designed an automatic evaluation method to assess eight models of varying scales. Additional expert evaluation validated the reliability of the automatic method.

Average Accuracy

We present the average accuracy of models of different scales on A‑Eval.

  • (a) Average accuracy of models of different scales across all tasks and difficulty levels. Dashed lines represent expert evaluation results, solid lines represent automatic evaluation results with different scoring thresholds T.
  • (b) Average accuracy of models of different scales on easy, medium, and hard data. Dashed lines represent expert evaluation results, solid lines represent automatic evaluation results using thresholds 90 and 60.

Task‑wise Accuracy

For each specific task and its sub‑tasks, we show the average accuracy of models of different scales.

  • (a) Accuracy when T = 60.
  • (b) Accuracy when T = 90.

Model Selection

The best model is defined as the smallest‑scale model that meets the required accuracy. Using the evaluation results, users can easily identify the best model by drawing a horizontal line on the performance chart.

Citation

If you use our benchmark or dataset in your research, please cite our paper.

bash @misc{lian2024best, title={What is the best model? Application-driven Evaluation for Large Language Models}, author={Shiguo Lian and Kaikai Zhao and Xinhui Liu and Xuejiao Lei and Bikun Yang and Wenjing Zhang and Kai Wang and Zhaoxiang Liu}, year={2024}, eprint={2406.10307}, archivePrefix={arXiv}, }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Model Evaluation
Chat Large Language Models

Source

Organization: github

Created: 8/9/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.