LiveBench
LiveBench is a large‑language‑model (LLM) benchmark created jointly by Abacus.AI, NYU, Nvidia, UMD, and USC. It contains 18 tasks spanning mathematics, programming, reasoning, language understanding, instruction following, and data analysis. LiveBench's questions are sourced from up‑to‑date materials such as recent math competitions, arXiv papers, news articles, and datasets, and answers are automatically scored against objective facts, eliminating the need for LLM or human judges. The benchmark aims to address data contamination issues in traditional evaluations, ensuring fairness and validity.
Dataset description and usage context
LiveBench Dataset Overview
Dataset Introduction
LiveBench is a benchmark specifically designed for large language models (LLMs) to avoid test‑set contamination and enable objective evaluation. Its key characteristics are:
- Regular Updates: New questions are released monthly, based on recent datasets, arXiv papers, news articles, and IMDb movie summaries.
- Objective Scoring: Every question has a verifiable, objective correct answer, allowing automatic accurate scoring without LLM judges.
- Diversity: Currently includes 17 distinct tasks across 6 categories, with more challenging tasks to be added regularly.
Dataset Content
LiveBench comprises multiple tasks covering the following categories:
- Reasoning
- Programming
- Mathematics
- Data Analysis
- Language
- Comprehensive Evaluation
Dataset Usage
Users can evaluate their models by submitting an issue on GitHub or emailing livebench.ai@gmail.com.
Dataset Origin
LiveBench was developed collaboratively by:
- Abacus.AI: Colin White, Samuel Dooley, Manley Roberts, Arka Pal
- NYU: Ben Feuer, Ravid Shwartz‑Ziv, Chinmay Hegde, Yann LeCun, Micah Goldblum
- Nvidia: Siddhartha Jain
- UMD: Tom Goldstein
- USC: Willie Neiswanger
Citation
To cite the LiveBench dataset, use the following BibTeX entry:
@article{livebench,
author = {White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah},
title = {LiveBench: A Challenging, Contamination-Free LLM Benchmark},
url = {arXiv preprint arXiv:2406.19314},
year = {2024},
}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.