Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Category index
Showing 4 of 4 datasets
Category: Language Model Evaluation

AlignBench

Language Model EvaluationEvaluation Dataset

A comprehensive multidimensional benchmark for evaluating the alignment of large language models in Chinese. It employs a human‑in‑the‑loop data collection pipeline and uses rule‑calibrated multidimensional LLM‑as‑Judge with chain‑of‑thought reasoning to generate explanations and final ratings, ensuring high reliability and interpretability.

Source arXivUpdated Dec 6, 2023432 viewsLinked
Inspect dataset

Chinese-SimpleQA

Natural Language ProcessingLanguage Model Evaluation

Chinese SimpleQA is a comprehensive Chinese benchmark for evaluating factual correctness of language models on short questions. It features five characteristics: Chinese language, diversity, high quality, static references, and ease of evaluation. The dataset covers six major topics with 99 fine‑grained sub‑topics, spanning humanities to science and engineering, containing 3,000 high‑quality questions to help developers assess factual accuracy in Chinese and support algorithm research.

Source huggingfaceUpdated Nov 17, 20241,305 viewsLinked
Inspect dataset

Shopping MMLU

Online ShoppingLanguage Model Evaluation

Shopping MMLU is a large‑scale multi‑task online‑shopping benchmark dataset created by Amazon. It is designed to comprehensively evaluate large language models (LLMs) on multiple shopping‑related tasks. The dataset comprises 57 tasks covering four core shopping skills—concept understanding, knowledge reasoning, user‑behavior alignment, and multilingual capability—totaling 20,799 questions. It was constructed from authentic Amazon data and reformulated into text‑generation tasks to suit LLM solutions. Shopping MMLU is primarily intended for online‑shopping assistants, aiming to improve the shopping experience by reducing task‑specific engineering effort and enabling interactive user dialogues.

Source arXivUpdated Oct 28, 2024306 viewsLinked
Inspect dataset

PediaBench

Pediatric MedicineLanguage Model Evaluation

PediaBench is a Chinese dataset specifically designed to evaluate large language models (LLMs) on pediatric question‑answering tasks. Created by research teams at Guizhou University and East China Normal University, it contains 4,565 objective questions and 1,632 subjective questions covering 12 pediatric diseases. Sources include the Chinese National Medical Licensing Examination, university final exams, and pediatric diagnostic and treatment standards. The dataset was built by collecting questions from multiple reliable sources and applying comprehensive scoring criteria to assess LLMs in instruction following, knowledge understanding, and clinical case analysis. PediaBench addresses the lack of pediatric coverage in existing medical QA datasets, providing a thorough benchmark for LLMs in the pediatric domain.

Source arXivUpdated Dec 9, 2024212 viewsLinked
Inspect dataset