Dataset assetOpen Source CommunityNatural Language ProcessingQuestion Answering Systems

SimpleQA-Bench

SimpleQA‑Bench combines the SimpleQA and Chinese‑SimpleQA datasets into a multiple‑choice question (MCQ) format. The original datasets contain a large amount of long‑tail and niche knowledge, yielding low direct answer accuracy. To facilitate factuality evaluation, GPT‑4o generated three plausible yet incorrect options for each question, converting the QA pairs into MCQ format. A total of 7,324 samples were transformed, with fields including dataset name, metadata, question, answer, messages, options, and the correct option ID.

Source

huggingface

Created

Dec 6, 2024

Updated

Dec 17, 2024

Signals

322 views

Availability

Linked source ready

Overview

Dataset description and usage context

SimpleQA‑Bench

Basic Information

Language: English (en)
License: MIT
Tags: factuality, EN, ZH, short-form-answer, human-label
Copyright: © 2024 alibaba‑pai

Data Sources

SimpleQA: Blog & Paper / Data & simple‑evals Project
Chinese‑SimpleQA: Blog & Paper, Data@HF

Dataset Description

Format: Multiple‑choice question (MCQ)
Processing: Merged SimpleQA and Chinese‑SimpleQA collections, then converted to MCQ. GPT‑4o generated three reasonable incorrect options per item.
Sample Count: 4,326 (SimpleQA) + 2,998 (Chinese‑SimpleQA) = 7,324 samples

Data Fields

Field	Description	SimpleQA Example	Chinese‑SimpleQA Example
`dataset` (str)	Dataset name	openai/SimpleQA	OpenStellarTeam/Chinese‑SimpleQA
`metadata` (str)	Metadata such as topics, source URLs	{"topic": "Science and technology", "answer_type": "Person", "urls": ["https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://ieeexplore.ieee.org/author/37271220500", "https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://www.nxtbook.com/nxtbooks/ieee/awards_2010/index.php?startid=21#/p/20"]}	{"id": "6fd2645ad3994c89a01acae98cf04f90", "primary_category": "Natural and Physical Sciences", "secondary_category": "Information Science", "urls": ["https://zh.wikipedia.org/wiki/%E8%92%99%E7%89%B9%E5%8D%A1%E6%B4%9B%E6%A0%91%E6%90%9C%E7%B4%A2"]}
`question` (str)	Question text	Who received the IEEE Frank Rosenblatt Award in 2010?	Which researcher first explored Monte‑Carlo tree search in their 1987 PhD thesis and first presented its key characteristics?
`answer` (str)	Human‑verified short answer	Michio Sugeno	Bruce Abramson
`messages` (List[Dict])	Standard OpenAI messages used for answering the MCQ	[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "# Objective ... Answers: "}]	same
`options` (List[str])	All answer options with IDs A/B/C/D	["Lotfi Zadeh", "Michio Sugeno", "John McCarthy", "Stephen Grossberg"]	["Bruce Abramson", "Lennart Batsch‑Fischer", "Chris Watkins", "Martin Hansen"]
`answer_option` (str)	Correct option ID (A/B/C/D)	B	A

Prompts

GEN_WA_RROMPT: Prompt used to generate MCQs, requesting three plausible incorrect answers.
ANSWER_MCQ_PROMPT: Prompt used to answer MCQs, directing the model to select the correct option.

Performance Comparison

LLM	SimpleQA (4,326)	SimpleQA‑MCQ	Chinese‑SimpleQA (2,998)	Chinese‑SimpleQA‑MCQ
gpt‑4o‑mini‑2024‑07‑18	9.5	41.2 (1,781/4,326)	37.6	52.9 (1,586/2,997)
qwen‑max	/	52.5 (2,256/4,300)	54.1	72.7 (2,177/2,996)

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio