derek-thomas/ScienceQA
The ScienceQA dataset is a multimodal science question‑answering collection covering numerous domains such as chemistry, biology, physics, earth science, engineering, geography, history, civics, economics, global studies, grammar, writing, vocabulary, natural science, language science, and social science. The dataset comprises fields including images, questions, multiple‑choice options, answers, hints, task descriptions, grade levels, subjects, topics, categories, skills, lectures, and solutions. It is primarily intended for multimodal multiple‑choice tasks, supporting question answering (multiple choice, closed‑domain, open‑domain), visual question answering, and multi‑class classification. The dataset was created to diagnose AI systems’ multi‑hop reasoning capability and explainability, especially in scientific question answering. The language is English, with a scale ranging from 10 K to 100 K instances, split into training, validation, and test sets.
Description
Dataset Overview
Dataset Name: ScienceQA
Dataset Size: 27263474 bytes
Download Size: 0 bytes
Language: English
Multilinguality: Monolingual
License: CC-BY-SA-4.0
Task Categories:
- Multiple Choice
- Question Answering
- Other
- Visual Question Answering
- Text Classification
Task IDs:
- Multiple Choice QA
- Closed-domain QA
- Open-domain QA
- Visual QA
- Multi‑class Classification
Tags:
- Multimodal QA
- Science
- Chemistry
- Biology
- Physics
- Earth Science
- Engineering
- Geography
- History
- World History
- Civics
- Economics
- Global Studies
- Grammar
- Writing
- Vocabulary
- Natural Science
- Language Science
- Social Science
Dataset Structure
Data Instances: Each instance contains the following fields:
image: Context imagequestion: Prompt related tolecturechoices: Multiple‑choice answer options associated withquestion(one correct)answer: Index of the correct optionhint: Hint to help answerquestiontask: Task descriptiongrade: K‑12 grade levelsubject: High‑level subjecttopic: Natural Science, Social Science, or Language Sciencecategory: Sub‑category oftopicskill: Description of the task requirementlecture: Lecture related to the generation ofquestionsolution: Explanation for solvingquestion
Data Splits:
train: 12 726 instances, 16 416 902 bytesvalidation: 4 241 instances, 5 404 896 bytestest: 4 241 instances, 5 441 676 bytes
Dataset Creation
Source Data: The dataset was collected from science curricula of elementary and secondary schools.
Annotation Process: Questions were sourced from open resources of IXL Learning and managed by experts in K‑12 education. The dataset includes questions that align with California Common Core standards. Original science questions were downloaded and component extraction (question, hint, image, options, answer, lecture, solution) was performed using heuristic rules. Invalid questions—such as those with a single option, erroneous data, or duplicates—were manually removed to comply with fair‑use and transformation‑use legal requirements. If multiple correct answers existed, only one was retained. Answer options were shuffled to avoid systematic patterns. Semi‑automatic scripts re‑formatted lectures and solutions so that special structures (tables, lists) in the text were distinguishable from plain paragraphs.
Annotators: Expert‑generated and discovered.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.