Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingMedical QA
medical-qa-id-filtered-split
This dataset is a medical question‑answering collection containing system prompts, question IDs, question texts, original answer texts, answer lengths, and other features. It is split into training, validation, and test sets with 89,101, 4,950 and 4,951 samples respectively. The download size is 42,351,649 bytes and the total size is 83,382,248 bytes. The source is https://huggingface.co/datasets/lintangbs/medical-qa-id-llama, and preprocessing steps include removing empty lines and limiting the maximum token count to 1,024.
Source
huggingface
Created
Nov 19, 2024
Updated
Nov 30, 2024
Signals
146 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Information
-
Feature Fields:
Unnamed: 0: data typeint64system_prompt: data typestringqas_id: data typestringquestion_text: data typestringorig_answer_texts: data typestringanswer_lengths: data typefloat64__index_level_0__: data typeint64
-
Dataset Split:
- Training Set:
- Sample count: 89,101
- Bytes: 74,957,465
- Validation Set:
- Sample count: 4,950
- Bytes: 4,202,516
- Test Set:
- Sample count: 4,951
- Bytes: 4,222,267
- Training Set:
-
Dataset Size:
- Download size: 42,351,649 bytes
- Total size: 83,382,248 bytes
Configuration Information
- Configuration Name: default
- Data File Paths:
- Training:
data/train-* - Validation:
data/validation-* - Test:
data/test-*
- Training:
- Data File Paths:
Dataset Processing
- Original Dataset: lintangbs/medical-qa-id-llama
- Processing Details:
- Removed empty lines
- Limited maximum token count to 1,024 to fit smaller models
Dataset Split Ratios
- Training: 90%
- Validation: 5%
- Test: 5%
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.