Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingMedical QA

medical-qa-id-filtered-split

This dataset is a medical question‑answering collection containing system prompts, question IDs, question texts, original answer texts, answer lengths, and other features. It is split into training, validation, and test sets with 89,101, 4,950 and 4,951 samples respectively. The download size is 42,351,649 bytes and the total size is 83,382,248 bytes. The source is https://huggingface.co/datasets/lintangbs/medical-qa-id-llama, and preprocessing steps include removing empty lines and limiting the maximum token count to 1,024.

Source
huggingface
Created
Nov 19, 2024
Updated
Nov 30, 2024
Signals
146 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Information

  • Feature Fields:

    • Unnamed: 0: data type int64
    • system_prompt: data type string
    • qas_id: data type string
    • question_text: data type string
    • orig_answer_texts: data type string
    • answer_lengths: data type float64
    • __index_level_0__: data type int64
  • Dataset Split:

    • Training Set:
      • Sample count: 89,101
      • Bytes: 74,957,465
    • Validation Set:
      • Sample count: 4,950
      • Bytes: 4,202,516
    • Test Set:
      • Sample count: 4,951
      • Bytes: 4,222,267
  • Dataset Size:

    • Download size: 42,351,649 bytes
    • Total size: 83,382,248 bytes

Configuration Information

  • Configuration Name: default
    • Data File Paths:
      • Training: data/train-*
      • Validation: data/validation-*
      • Test: data/test-*

Dataset Processing

  • Original Dataset: lintangbs/medical-qa-id-llama
  • Processing Details:
    • Removed empty lines
    • Limited maximum token count to 1,024 to fit smaller models

Dataset Split Ratios

  • Training: 90%
  • Validation: 5%
  • Test: 5%
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio