Dataset assetOpen Source CommunityNatural Language ProcessingMedical QA

medical-qa-id-filtered-split

This dataset is a medical question‑answering collection containing system prompts, question IDs, question texts, original answer texts, answer lengths, and other features. It is split into training, validation, and test sets with 89,101, 4,950 and 4,951 samples respectively. The download size is 42,351,649 bytes and the total size is 83,382,248 bytes. The source is https://huggingface.co/datasets/lintangbs/medical-qa-id-llama, and preprocessing steps include removing empty lines and limiting the maximum token count to 1,024.

Source

huggingface

Created

Nov 19, 2024

Updated

Nov 30, 2024

Signals

146 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Information

Feature Fields:
- Unnamed: 0: data type int64
- system_prompt: data type string
- qas_id: data type string
- question_text: data type string
- orig_answer_texts: data type string
- answer_lengths: data type float64
- __index_level_0__: data type int64
Dataset Split:
- Training Set:
  - Sample count: 89,101
  - Bytes: 74,957,465
- Validation Set:
  - Sample count: 4,950
  - Bytes: 4,202,516
- Test Set:
  - Sample count: 4,951
  - Bytes: 4,222,267
Dataset Size:
- Download size: 42,351,649 bytes
- Total size: 83,382,248 bytes

Configuration Information

Configuration Name: default
- Data File Paths:
  - Training: data/train-*
  - Validation: data/validation-*
  - Test: data/test-*

Dataset Processing

Original Dataset: lintangbs/medical-qa-id-llama
Processing Details:
- Removed empty lines
- Limited maximum token count to 1,024 to fit smaller models

Dataset Split Ratios

Training: 90%
Validation: 5%
Test: 5%

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio