JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

AlignBench

Language Model Evaluation
Evaluation Dataset

A comprehensive multidimensional benchmark for evaluating the alignment of large language models in Chinese. It employs a human‑in‑the‑loop data collection pipeline and uses rule‑calibrated multidimensional LLM‑as‑Judge with chain‑of‑thought reasoning to generate explanations and final ratings, ensuring high reliability and interpretability.

arXiv
View Details

Chinese-SimpleQA

Natural Language Processing
Language Model Evaluation

Chinese SimpleQA is a comprehensive Chinese benchmark for evaluating factual correctness of language models on short questions. It features five characteristics: Chinese language, diversity, high quality, static references, and ease of evaluation. The dataset covers six major topics with 99 fine‑grained sub‑topics, spanning humanities to science and engineering, containing 3,000 high‑quality questions to help developers assess factual accuracy in Chinese and support algorithm research.

huggingface
View Details

Shopping MMLU

Online Shopping
Language Model Evaluation

Shopping MMLU is a large‑scale multi‑task online‑shopping benchmark dataset created by Amazon. It is designed to comprehensively evaluate large language models (LLMs) on multiple shopping‑related tasks. The dataset comprises 57 tasks covering four core shopping skills—concept understanding, knowledge reasoning, user‑behavior alignment, and multilingual capability—totaling 20,799 questions. It was constructed from authentic Amazon data and reformulated into text‑generation tasks to suit LLM solutions. Shopping MMLU is primarily intended for online‑shopping assistants, aiming to improve the shopping experience by reducing task‑specific engineering effort and enabling interactive user dialogues.

arXiv
View Details

PediaBench

Pediatric Medicine
Language Model Evaluation

PediaBench is a Chinese dataset specifically designed to evaluate large language models (LLMs) on pediatric question‑answering tasks. Created by research teams at Guizhou University and East China Normal University, it contains 4,565 objective questions and 1,632 subjective questions covering 12 pediatric diseases. Sources include the Chinese National Medical Licensing Examination, university final exams, and pediatric diagnostic and treatment standards. The dataset was built by collecting questions from multiple reliable sources and applying comprehensive scoring criteria to assess LLMs in instruction following, knowledge understanding, and clinical case analysis. PediaBench addresses the lack of pediatric coverage in existing medical QA datasets, providing a thorough benchmark for LLMs in the pediatric domain.

arXiv
View Details