medical-qa-id-filtered-split
This dataset is a medical question‑answering collection containing system prompts, question IDs, question texts, original answer texts, answer lengths, and other features. It is split into training, validation, and test sets with 89,101, 4,950 and 4,951 samples respectively. The download size is 42,351,649 bytes and the total size is 83,382,248 bytes. The source is https://huggingface.co/datasets/lintangbs/medical-qa-id-llama, and preprocessing steps include removing empty lines and limiting the maximum token count to 1,024.
Description
Dataset Overview
Dataset Information
-
Feature Fields:
Unnamed: 0: data typeint64system_prompt: data typestringqas_id: data typestringquestion_text: data typestringorig_answer_texts: data typestringanswer_lengths: data typefloat64__index_level_0__: data typeint64
-
Dataset Split:
- Training Set:
- Sample count: 89,101
- Bytes: 74,957,465
- Validation Set:
- Sample count: 4,950
- Bytes: 4,202,516
- Test Set:
- Sample count: 4,951
- Bytes: 4,222,267
- Training Set:
-
Dataset Size:
- Download size: 42,351,649 bytes
- Total size: 83,382,248 bytes
Configuration Information
- Configuration Name: default
- Data File Paths:
- Training:
data/train-* - Validation:
data/validation-* - Test:
data/test-*
- Training:
- Data File Paths:
Dataset Processing
- Original Dataset: lintangbs/medical-qa-id-llama
- Processing Details:
- Removed empty lines
- Limited maximum token count to 1,024 to fit smaller models
Dataset Split Ratios
- Training: 90%
- Validation: 5%
- Test: 5%
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 11/19/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.