Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingQuestion Answering Systems
SimpleQA-Bench
SimpleQA‑Bench combines the SimpleQA and Chinese‑SimpleQA datasets into a multiple‑choice question (MCQ) format. The original datasets contain a large amount of long‑tail and niche knowledge, yielding low direct answer accuracy. To facilitate factuality evaluation, GPT‑4o generated three plausible yet incorrect options for each question, converting the QA pairs into MCQ format. A total of 7,324 samples were transformed, with fields including dataset name, metadata, question, answer, messages, options, and the correct option ID.
Source
huggingface
Created
Dec 6, 2024
Updated
Dec 17, 2024
Signals
322 views
Availability
Linked source ready
Overview
Dataset description and usage context
SimpleQA‑Bench
Basic Information
- Language: English (en)
- License: MIT
- Tags:
factuality,EN,ZH,short-form-answer,human-label - Copyright: © 2024 alibaba‑pai
Data Sources
- SimpleQA: Blog & Paper / Data & simple‑evals Project
- Chinese‑SimpleQA: Blog & Paper, Data@HF
Dataset Description
- Format: Multiple‑choice question (MCQ)
- Processing: Merged SimpleQA and Chinese‑SimpleQA collections, then converted to MCQ. GPT‑4o generated three reasonable incorrect options per item.
- Sample Count: 4,326 (SimpleQA) + 2,998 (Chinese‑SimpleQA) = 7,324 samples
Data Fields
| Field | Description | SimpleQA Example | Chinese‑SimpleQA Example |
|---|---|---|---|
dataset (str) | Dataset name | openai/SimpleQA | OpenStellarTeam/Chinese‑SimpleQA |
metadata (str) | Metadata such as topics, source URLs | {"topic": "Science and technology", "answer_type": "Person", "urls": ["https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://ieeexplore.ieee.org/author/37271220500", "https://en.wikipedia.org/wiki/IEEE_Frank_Rosenblatt_Award", "https://www.nxtbook.com/nxtbooks/ieee/awards_2010/index.php?startid=21#/p/20"]} | {"id": "6fd2645ad3994c89a01acae98cf04f90", "primary_category": "Natural and Physical Sciences", "secondary_category": "Information Science", "urls": ["https://zh.wikipedia.org/wiki/%E8%92%99%E7%89%B9%E5%8D%A1%E6%B4%9B%E6%A0%91%E6%90%9C%E7%B4%A2"]} |
question (str) | Question text | Who received the IEEE Frank Rosenblatt Award in 2010? | Which researcher first explored Monte‑Carlo tree search in their 1987 PhD thesis and first presented its key characteristics? |
answer (str) | Human‑verified short answer | Michio Sugeno | Bruce Abramson |
messages (List[Dict]) | Standard OpenAI messages used for answering the MCQ | [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "# Objective ... Answers: "}] | same |
options (List[str]) | All answer options with IDs A/B/C/D | ["Lotfi Zadeh", "Michio Sugeno", "John McCarthy", "Stephen Grossberg"] | ["Bruce Abramson", "Lennart Batsch‑Fischer", "Chris Watkins", "Martin Hansen"] |
answer_option (str) | Correct option ID (A/B/C/D) | B | A |
Prompts
- GEN_WA_RROMPT: Prompt used to generate MCQs, requesting three plausible incorrect answers.
- ANSWER_MCQ_PROMPT: Prompt used to answer MCQs, directing the model to select the correct option.
Performance Comparison
| LLM | SimpleQA (4,326) | SimpleQA‑MCQ | Chinese‑SimpleQA (2,998) | Chinese‑SimpleQA‑MCQ |
|---|---|---|---|---|
| gpt‑4o‑mini‑2024‑07‑18 | 9.5 | 41.2 (1,781/4,326) | 37.6 | 52.9 (1,586/2,997) |
| qwen‑max | / | 52.5 (2,256/4,300) | 54.1 | 72.7 (2,177/2,996) |
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.