JUHE API Marketplace
DATASET
Open Source Community

derek-thomas/ScienceQA

The ScienceQA dataset is a multimodal science question‑answering collection covering numerous domains such as chemistry, biology, physics, earth science, engineering, geography, history, civics, economics, global studies, grammar, writing, vocabulary, natural science, language science, and social science. The dataset comprises fields including images, questions, multiple‑choice options, answers, hints, task descriptions, grade levels, subjects, topics, categories, skills, lectures, and solutions. It is primarily intended for multimodal multiple‑choice tasks, supporting question answering (multiple choice, closed‑domain, open‑domain), visual question answering, and multi‑class classification. The dataset was created to diagnose AI systems’ multi‑hop reasoning capability and explainability, especially in scientific question answering. The language is English, with a scale ranging from 10 K to 100 K instances, split into training, validation, and test sets.

Updated 2/25/2023
hugging_face

Description

Dataset Overview

Dataset Name: ScienceQA

Dataset Size: 27263474 bytes

Download Size: 0 bytes

Language: English

Multilinguality: Monolingual

License: CC-BY-SA-4.0

Task Categories:

  • Multiple Choice
  • Question Answering
  • Other
  • Visual Question Answering
  • Text Classification

Task IDs:

  • Multiple Choice QA
  • Closed-domain QA
  • Open-domain QA
  • Visual QA
  • Multi‑class Classification

Tags:

  • Multimodal QA
  • Science
  • Chemistry
  • Biology
  • Physics
  • Earth Science
  • Engineering
  • Geography
  • History
  • World History
  • Civics
  • Economics
  • Global Studies
  • Grammar
  • Writing
  • Vocabulary
  • Natural Science
  • Language Science
  • Social Science

Dataset Structure

Data Instances: Each instance contains the following fields:

  • image: Context image
  • question: Prompt related to lecture
  • choices: Multiple‑choice answer options associated with question (one correct)
  • answer: Index of the correct option
  • hint: Hint to help answer question
  • task: Task description
  • grade: K‑12 grade level
  • subject: High‑level subject
  • topic: Natural Science, Social Science, or Language Science
  • category: Sub‑category of topic
  • skill: Description of the task requirement
  • lecture: Lecture related to the generation of question
  • solution: Explanation for solving question

Data Splits:

  • train: 12 726 instances, 16 416 902 bytes
  • validation: 4 241 instances, 5 404 896 bytes
  • test: 4 241 instances, 5 441 676 bytes

Dataset Creation

Source Data: The dataset was collected from science curricula of elementary and secondary schools.

Annotation Process: Questions were sourced from open resources of IXL Learning and managed by experts in K‑12 education. The dataset includes questions that align with California Common Core standards. Original science questions were downloaded and component extraction (question, hint, image, options, answer, lecture, solution) was performed using heuristic rules. Invalid questions—such as those with a single option, erroneous data, or duplicates—were manually removed to comply with fair‑use and transformation‑use legal requirements. If multiple correct answers existed, only one was retained. Answer options were shuffled to avoid systematic patterns. Semi‑automatic scripts re‑formatted lectures and solutions so that special structures (tables, lists) in the text were distinguishable from plain paragraphs.

Annotators: Expert‑generated and discovered.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Scientific QA
Multimodal Reasoning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.