High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

medical-qa-id-filtered-split

This dataset is a medical question‑answering collection containing system prompts, question IDs, question texts, original answer texts, answer lengths, and other features. It is split into training, validation, and test sets with 89,101, 4,950 and 4,951 samples respectively. The download size is 42,351,649 bytes and the total size is 83,382,248 bytes. The source is https://huggingface.co/datasets/lintangbs/medical-qa-id-llama, and preprocessing steps include removing empty lines and limiting the maximum token count to 1,024.

huggingface

View Details

bigbio/med_qa

Medical QA

Multilingual Processing

We present MedQA, the first free‑form multiple‑choice open‑domain QA dataset for medicine, derived from professional medical examinations. It covers three languages—English, Simplified Chinese, and Traditional Chinese (Taiwan)—with 12 723, 34 251, and 14 123 questions respectively. In addition to the QA pairs, we release a large corpus of medical‑text extracted from textbooks to support reading‑comprehension models.

hugging_face

View Details