Explore high-quality datasets for your AI and machine learning projects.
A comprehensive multidimensional benchmark for evaluating the alignment of large language models in Chinese. It employs a human‑in‑the‑loop data collection pipeline and uses rule‑calibrated multidimensional LLM‑as‑Judge with chain‑of‑thought reasoning to generate explanations and final ratings, ensuring high reliability and interpretability.
OlympiadBench is an Olympic‑level bilingual multimodal scientific benchmark, containing 8,476 questions from Olympic‑level mathematics and physics competitions, including the Chinese Gaokao. Each question is accompanied by expert‑level step‑by‑step reasoning annotations. The dataset aims to challenge and advance AGI development through complex scientific problems. The best‑performing model, GPT‑4V, achieved an average score of only 17.97% on this benchmark, dropping to as low as 10.74% in the physics domain, highlighting the benchmark's rigor and the complexity of physical reasoning.