SimpleQA-Bench
SimpleQA‑Bench combines the SimpleQA and Chinese‑SimpleQA datasets into a multiple‑choice question (MCQ) format. The original datasets contain a large amount of long‑tail and niche knowledge, yielding low direct answer accuracy. To facilitate factuality evaluation, GPT‑4o generated three plausible yet incorrect options for each question, converting the QA pairs into MCQ format. A total of 7,324 samples were transformed, with fields including dataset name, metadata, question, answer, messages, options, and the correct option ID.
Description
SimpleQA‑Bench
Basic Information
- Language: English (en)
- License: MIT
- Tags:
factuality,EN,ZH,short-form-answer,human-label - Copyright: © 2024 alibaba‑pai
Data Sources
- SimpleQA: Blog & Paper / Data & simple‑evals Project
- Chinese‑SimpleQA: Blog & Paper, Data@HF
Dataset Description
- Format: Multiple‑choice question (MCQ)
- Processing: Merged SimpleQA and Chinese‑SimpleQA collections, then converted to MCQ. GPT‑4o generated three reasonable incorrect options per item.
- Sample Count: 4,326 (SimpleQA) + 2,998 (Chinese‑SimpleQA) = 7,324 samples
Data Fields
| Field | Description | SimpleQA Example | Chinese‑SimpleQA Example |
|---|---|---|---|
dataset (str) | Dataset name | openai/SimpleQA | OpenStellarTeam/Chinese‑SimpleQA |
metadata (str) | Metadata such as topics, source URLs | {"topic": "Science and technology", "answer_type": "Person", "urls": ["https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://ieeexplore.ieee.org/author/37271220500", "https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://www.nxtbook.com/nxtbooks/ieee/awards_2010/index.php?startid=21#/p/20"]} | {"id": "6fd2645ad3994c89a01acae98cf04f90", "primary_category": "Natural and Physical Sciences", "secondary_category": "Information Science", "urls": ["https://zh.wikipedia.org/wiki/%E8%92%99%E7%89%B9%E5%8D%A1%E6%B4%9B%E6%A0%91%E6%90%9C%E7%B4%A2"]} |
question (str) | Question text | Who received the IEEE Frank Rosenblatt Award in 2010? | Which researcher first explored Monte‑Carlo tree search in their 1987 PhD thesis and first presented its key characteristics? |
answer (str) | Human‑verified short answer | Michio Sugeno | Bruce Abramson |
messages (List[Dict]) | Standard OpenAI messages used for answering the MCQ | [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "# Objective ... Answers: "}] | same |
options (List[str]) | All answer options with IDs A/B/C/D | ["Lotfi Zadeh", "Michio Sugeno", "John McCarthy", "Stephen Grossberg"] | ["Bruce Abramson", "Lennart Batsch‑Fischer", "Chris Watkins", "Martin Hansen"] |
answer_option (str) | Correct option ID (A/B/C/D) | B | A |
Prompts
- GEN_WA_RROMPT: Prompt used to generate MCQs, requesting three plausible incorrect answers.
- ANSWER_MCQ_PROMPT: Prompt used to answer MCQs, directing the model to select the correct option.
Performance Comparison
| LLM | SimpleQA (4,326) | SimpleQA‑MCQ | Chinese‑SimpleQA (2,998) | Chinese‑SimpleQA‑MCQ |
|---|---|---|---|---|
| gpt‑4o‑mini‑2024‑07‑18 | 9.5 | 41.2 (1,781/4,326) | 37.6 | 52.9 (1,586/2,997) |
| qwen‑max | / | 52.5 (2,256/4,300) | 54.1 | 72.7 (2,177/2,996) |
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 12/6/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.