Dataset assetOpen Source CommunityQuestion Answering SystemsCommon-sense Reasoning

commonsense_qa

CommonsenseQA is a new multiple‑choice QA dataset that requires using various types of commonsense knowledge to predict the correct answer. The dataset provides two main train/validation/test splits: 'random split' and 'question‑label split' (see the paper for details). It contains a training set (9,741 samples), a validation set (1,221 samples), and a test set (1,140 samples). Each sample includes a unique ID, question text, question concept, options (label and text), and an answer key. The dataset is in English and is released under the MIT license.

Source

huggingface

Created

Jul 22, 2024

Updated

Aug 5, 2024

Signals

375 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Description

Name: CommonsenseQA
Language: English (en)
License: MIT
Multilinguality: Monolingual
Size Category: 1K<n<10K
Source Dataset: Original data
Task Category: Question Answering
Task ID: Open‑domain QA
PapersWithCode ID: commonsenseqa
Alias: CommonsenseQA

Dataset Structure

Features

id (string): Unique ID
question (string): Question
question_concept (string): ConceptNet concept related to the question
choices (dictionary):
- label (string): Option label
- text (string): Option text
answerKey (string): Answer

Splits

train
- Bytes: 2,207,794
- Samples: 9,741
validation
- Bytes: 273,848
- Samples: 1,221
test
- Bytes: 257,842
- Samples: 1,140

Configurations

default
- Data files:
  - train: data/train-*
  - validation: data/validation-*
  - test: data/test-*

Dataset Creation

License Information

The dataset is released under the MIT license.

Citation Information

@inproceedings{talmor-etal-2019-commonsenseqa, title = "{C}ommonsense{QA}: A Question Answering Challenge Targeting Commonsense Knowledge", author = "Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan", booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N19-1421", doi = "10.18653/v1/N19-1421", pages = "4149--4158", archivePrefix = "arXiv", eprint = "1811.00937", primaryClass = "cs", }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio