derek-thomas/ScienceQA

The ScienceQA dataset is a multimodal science question‑answering collection covering numerous domains such as chemistry, biology, physics, earth science, engineering, geography, history, civics, economics, global studies, grammar, writing, vocabulary, natural science, language science, and social science. The dataset comprises fields including images, questions, multiple‑choice options, answers, hints, task descriptions, grade levels, subjects, topics, categories, skills, lectures, and solutions. It is primarily intended for multimodal multiple‑choice tasks, supporting question answering (multiple choice, closed‑domain, open‑domain), visual question answering, and multi‑class classification. The dataset was created to diagnose AI systems’ multi‑hop reasoning capability and explainability, especially in scientific question answering. The language is English, with a scale ranging from 10 K to 100 K instances, split into training, validation, and test sets.

Updated 2/25/2023

hugging_face

Description

Dataset Overview

Dataset Name: ScienceQA

Dataset Size: 27263474 bytes

Download Size: 0 bytes

Language: English

Multilinguality: Monolingual

License: CC-BY-SA-4.0

Task Categories:

Multiple Choice
Question Answering
Other
Visual Question Answering
Text Classification

Task IDs:

Multiple Choice QA
Closed-domain QA
Open-domain QA
Visual QA
Multi‑class Classification

Tags:

Multimodal QA
Science
Chemistry
Biology
Physics
Earth Science
Engineering
Geography
History
World History
Civics
Economics
Global Studies
Grammar
Writing
Vocabulary
Natural Science
Language Science
Social Science

Dataset Structure

Data Instances: Each instance contains the following fields:

image: Context image
question: Prompt related to lecture
choices: Multiple‑choice answer options associated with question (one correct)
answer: Index of the correct option
hint: Hint to help answer question
task: Task description
grade: K‑12 grade level
subject: High‑level subject
topic: Natural Science, Social Science, or Language Science
category: Sub‑category of topic
skill: Description of the task requirement
lecture: Lecture related to the generation of question
solution: Explanation for solving question

Data Splits:

train: 12 726 instances, 16 416 902 bytes
validation: 4 241 instances, 5 404 896 bytes
test: 4 241 instances, 5 441 676 bytes

Dataset Creation

Source Data: The dataset was collected from science curricula of elementary and secondary schools.

Annotation Process: Questions were sourced from open resources of IXL Learning and managed by experts in K‑12 education. The dataset includes questions that align with California Common Core standards. Original science questions were downloaded and component extraction (question, hint, image, options, answer, lecture, solution) was performed using heuristic rules. Invalid questions—such as those with a single option, erroneous data, or duplicates—were manually removed to comply with fair‑use and transformation‑use legal requirements. If multiple correct answers existed, only one was retained. Answer options were shuffled to avoid systematic patterns. Semi‑automatic scripts re‑formatted lectures and solutions so that special structures (tables, lists) in the text were distinguishable from plain paragraphs.

Annotators: Expert‑generated and discovered.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Scientific QA

Multimodal Reasoning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →