Back to datasets
Dataset assetOpen Source CommunityMulti‑Task LearningChest X‑ray Image Analysis

MedHK23/OmniFM-Dr

The OmniFM‑Dr framework introduces a multi‑task chest X‑ray dataset for joint training of disease classification, localization, segmentation and report generation. The dataset aggregates several public sources such as MIMIC‑CXR, VinDr‑CXR and ChestX‑Det10. Each image may serve multiple tasks (e.g., report generation and classification). Due to compliance requirements the full dataset is not publicly released, but five sample examples per task are provided. The dataset description details file formats and usage for each sub‑dataset.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 23, 2024
Signals
110 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Description

OmniFM‑Dr introduces a multi‑task chest X‑ray dataset for joint training of disease classification, localization, segmentation, and report generation. The collection combines several publicly available datasets, such as MIMIC‑CXR, VinDr‑CXR, and ChestX‑Det10. Each image can contribute to multiple tasks, for example report generation and classification.

Note: Due to data‑compliance and regulatory constraints, the full dataset is currently unavailable. For each task, five example samples are provided.

Dataset Details

  • MIMIC: Over 377,110 radiographs from more than 227,835 radiology studies. Each image is paired with lesion classifications and the corresponding radiology report, supporting multi‑label classification and report‑generation tasks.
  • Padchest: 160,840 images from 67,000 patients covering six view positions. Various radiological findings are labeled for classification.
  • CXR‑AL14: Large‑scale chest X‑ray detection dataset with >140,000 images and 253,844 bounding boxes across 14 abnormality categories.
  • VinDr‑CXR: Chest radiographs for classification of 28 common diseases; 15,000 scans for training. Eight diseases with bounding boxes are selected for localization.
  • ChestX‑Det: 3,578 images from NIH ChestXray14 covering 13 diseases; seven diseases with bounding boxes are used for localization.
  • CheXmask: Lung and heart segmentation masks from six public databases, comprising 676,803 images. 224,316 are used for training; 10,000 from ChestXray14 serve as downstream evaluation.
  • SIIM: From the SIIM‑ACR pneumothorax segmentation challenge, 12,090 images with ~3,000 positive cases and associated masks.

Dataset Structure

  • MIMIC:
    • MIMIC_classification_report-generation_xxx.tsv: classification & report‑generation (fields: id, report, "label1 && label2", subject_id, study_id, dicom_id).
    • MIMIC_classification-location_xxx.tsv: localization VQA (fields: id, "label1,severity && label2,severity", subject_id, study_id, dicom_id).
    • MIMIC_classification-severity_xxx.tsv: severity VQA (fields: id, "label, location1 & location2", subject_id, study_id, dicom_id).
  • Padchest:
    • Padchest_classification_xxx.tsv: classification (fields: id, "label1 && label2", subject_id, study_id, dicom_id).
  • CXR‑AL14:
    • CXR_AL14_localization_xxx.tsv: localization & classification (fields: id, label, "x1,y1,x2,y2", image_id).
  • VinDr‑CXR:
    • VinDr_CXR_localization_xxx.tsv: same structure as above.
  • ChestX‑Det:
    • ChestX_Det_localization_xxx.tsv: same structure as above.
  • CheXmask:
    • CheXmask_segmentation_xxx.tsv: segmentation (fields: id, label, "x1,y1,x2,y2,…,x30,y30", subject_id, study_id, dicom_id).
  • SIIM:
    • SIIM_segmentation_xxx.tsv: segmentation (same fields as CheXmask).

Dataset Usage

Run data_prepare.py to build training batches for all tasks. Each line should contain: id, instruction, label, image_id, and task_type.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio