Explore high-quality datasets for your AI and machine learning projects.
The BiomixQA dataset is a biomedical question answering collection featuring two question types: multiple‑choice and true/false. It is used to evaluate the performance of knowledge‑graph‑enhanced retrieval‑augmented generation (KG‑RAG) frameworks across various large language models (LLMs). The dataset’s diversity lies in question formats and the covered biomedical concepts, making it especially suitable for assessing KG‑RAG performance. Additionally, the dataset supports research and development in biomedical NLP, knowledge graph reasoning, and QA systems. Sources include multiple biomedical knowledge graphs and databases such as SPOKE, DisGeNET, MONDO, SemMedDB, Monarch Initiative, and ROBOKOP.
This dataset is used for visual question answering and QA tasks, supporting both Chinese and English. It includes multiple configuration files such as ai2d_train_12k, chartqa_train_18k, etc., each corresponding to different types of training data files.
This dataset contains single‑turn dialogues with SMILES molecular descriptions, formatted as JSON and including SMILES strings with their corresponding textual descriptions. The dataset is split into training, validation, and test sets containing 264,391, 33,072, and 32,987 samples respectively. Dialogue templates consist of human queries and GPT‑generated molecule descriptions. Additionally, 14 query templates are provided for generating the query portion of the dialogues.
This foundational dataset is a collection of question‑answer pairs focused on the cybersecurity domain, primarily concerning threat hunting, threat intelligence, and malware content. The answers in the foundational dataset are concise, roughly 10% the length of those in the main dataset. The Q‑A pairs are generated from 2023–2024 data and selected semi‑randomly. The (unreleased) main dataset is expected to contain about 75,000–80,000 Q‑A pairs on its launch day, covering data from 2020 to present, with approximately 500 new pairs added weekly, and its answers are more detailed than those in the foundational dataset.
This dataset is primarily used for question answering and sentence similarity tasks in the biomedical domain. It includes two configurations: text‑corpus and question‑answer‑passages, each corresponding to different data file paths. The dataset originates from the training set of BioASQ Task 11b and subsets were generated using the `generate.py` script.
CommonsenseQA is a new multiple‑choice QA dataset that requires using various types of commonsense knowledge to predict the correct answer. The dataset provides two main train/validation/test splits: 'random split' and 'question‑label split' (see the paper for details). It contains a training set (9,741 samples), a validation set (1,221 samples), and a test set (1,140 samples). Each sample includes a unique ID, question text, question concept, options (label and text), and an answer key. The dataset is in English and is released under the MIT license.
The dataset is used for the QMSum task and contains two features: text content and answer length. It is split into a training set with 1,257 samples and a test set with 200 samples. The test set originates from the LongBench QMSum task, while the training set comes from the original QMSum repository. No built‑in validation set is provided; it is recommended to partition a portion of the training set for validation.
This dataset contains model prediction results generated by AutoTrain, used to evaluate extractive question answering tasks on the SQuAD v2 dataset. The model is MYX4567/distilbert-base-uncased-finetuned-squad, the dataset configuration is squad_v2, and the dataset split is the validation set.
--- pretty_name: WebGPT Comparisons --- # Dataset Card for WebGPT Comparisons ## Dataset Description In the [WebGPT paper](https://arxiv.org/abs/2112.09332), the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total. Each example in the dataset contains a pair of model answers for a question, and the associated metadata. Each answer has a preference score from humans that can be used to determine which of the two answers are better. Overall, an example has the following fields: * `question`: The text of the question, together with the name of the dataset from which it was taken and a unique ID. * `quotes_0`: The extracts that the model found while browsing for `answer_0`, together with the title of the page on which the extract was found, constructed from the HTML title and domain name of the page. * `answer_0`: The final answer that the model composed using `quotes_0`. * `tokens_0`: The prefix that would have been given to the model in the final step of the episode to create `answer_0`, and the completion given by the model or human. The prefix is made up of the question and the quotes, with some truncation, and the completion is simply the answer. Both are tokenized using the GPT-2 tokenizer. The concatenation of the prefix and completion is the input used for reward modeling. * `score_0`: The strength of the preference for `answer_0` over `answer_1` as a number from −1 to 1. It sums to 0 with `score_1`, and an answer is preferred if and only if its score is positive. For reward modeling, we treat scores of 0 as soft 50% labels, and all other scores as hard labels (using only their sign). * `quotes_1`: The counterpart to `quotes_0`. * `answer_1`: The counterpart to `answer_0`. * `tokens_1`: The counterpart to `tokens_0`. * `score_1`: The counterpart to `score_0`. This information was found in Appendix K of the WebGPT paper. ## Citation Information [https://arxiv.org/abs/2112.09332](https://arxiv.org/abs/2112.09332) ``` @inproceedings{nakano2021webgpt, author = {Reiichiro Nakano and Jacob Hilton and Suchir Balaji and Jeff Wu and Long Ouyang and Christina Kim and Christopher Hesse and Shantanu Jain and Vineet Kosaraju and William Saunders and Xu Jiang and Karl Cobbe and Tyna Eloundou and Gretchen Krueger and Kevin Button and Matthew Knight and Benjamin Chess and John Schulman}, title = {WebGPT: Browser-assisted question-answering with human feedback}, booktitle = {arXiv}, year = 2021, } ``` Dataset added to the Hugging Face Hub by [@Tristan](https://huggingface.co/Tristan) and [@natolambert](https://huggingface.co/natolambert)
The PQAref dataset is a reference question‑answering dataset for the biomedical domain, designed for fine‑tuning large language models. It comprises three components: an instruction (question), abstracts (relevant abstracts retrieved from PubMed, including PubMed ID, abstract title, and content), and an answer (expected answer with references in PubMed ID format). The dataset was created semi‑automatically, leveraging questions from the PubMedQA dataset.
A publicly available dataset and method for educational knowledge graph question answering. The dataset will be fully released after the paper is accepted.
The qa_wikipedia dataset is a question‑answering dataset containing multiple documents extracted from Wikipedia along with associated questions. Features include document ID, title, context, question, answer start position, answer text, and the full article. The dataset is split into training, test, and validation subsets for different modeling stages.
SimpleQA‑Bench combines the SimpleQA and Chinese‑SimpleQA datasets into a multiple‑choice question (MCQ) format. The original datasets contain a large amount of long‑tail and niche knowledge, yielding low direct answer accuracy. To facilitate factuality evaluation, GPT‑4o generated three plausible yet incorrect options for each question, converting the QA pairs into MCQ format. A total of 7,324 samples were transformed, with fields including dataset name, metadata, question, answer, messages, options, and the correct option ID.
This dataset is suitable for text generation and question‑answering tasks, primarily in Chinese. It contains two main fields, `conversations` and `tools`; `conversations` is a list of objects with string fields `from` and `value`, and `tools` is a string field. The dataset size ranges from 1K to 10K entries and is released under the Apache 2.0 license. It can be used in LLaMA Factory by specifying `dataset: glaive_toolcall_zh`.
The dataset is primarily intended for text analysis and processing, containing text content, metadata, and vector information. The metadata records in detail the answer to a question, identifier, prefix, the question itself, school ID, sequence number, source, tokenized question, URL, and vector data. The dataset is suitable for training models for text understanding and related tasks.
--- annotations_creators: - crowdsourced language_creators: - expert-generated language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - question-answering task_ids: - extractive-qa paperswithcode_id: wikihop pretty_name: WikiHop tags: - multi-hop dataset_info: - config_name: original features: - name: id dtype: string - name: query dtype: string - name: answer dtype: string - name: candidates sequence: string - name: supports sequence: string - name: annotations sequence: sequence: string splits: - name: train num_bytes: 325952974 num_examples: 43738 - name: validation num_bytes: 41246536 num_examples: 5129 download_size: 339843061 dataset_size: 367199510 - config_name: masked features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: candidates sequence: string - name: supports sequence: string - name: annotations sequence: sequence: string splits: - name: train num_bytes: 348249138 num_examples: 43738 - name: validation num_bytes: 44066862 num_examples: 5129 download_size: 339843061 dataset_size: 392316000 --- # Dataset Card for WikiHop ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [QAngaroo](http://qangaroo.cs.ucl.ac.uk/) - **Repository:** [If the dataset is hosted on github or has a github homepage, add URL here]() - **Paper:** [Constructing Datasets for Multi-hop Reading Comprehension Across Documents](https://arxiv.org/abs/1710.06481) - **Leaderboard:** [leaderboard](http://qangaroo.cs.ucl.ac.uk/leaderboard.html) - **Point of Contact:** [Johannes Welbl](j.welbl@cs.ucl.ac.uk) ### Dataset Summary [More Information Needed] ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.
This is a medical question‑answering dataset focusing on the Huangdi Neijing, supporting both Chinese and English.
Needle In A Multimodal Haystack (MM‑NIAH) is a comprehensive benchmark designed to systematically evaluate the capability of existing multimodal large language models (MLLMs) in understanding long multimodal documents. The benchmark requires models to answer specific questions based on key information scattered throughout multimodal documents. MM‑NIAH's evaluation data comprises three tasks: retrieval, counting, and reasoning. Key information (called “needles”) is embedded in the document's text or images; those inserted into text are referred to as text needles, and those inserted into images as image needles. Experimental results indicate that current MLLMs perform poorly when handling image‑based key information.
Based on Korean Wikipedia data, this dataset is processed into a question‑answer format. Its goal is to be processed via code rather than a language model, and new processing ideas will be uploaded as new versions.