MedHK23/OmniFM-Dr

The OmniFM‑Dr framework introduces a multi‑task chest X‑ray dataset for joint training of disease classification, localization, segmentation and report generation. The dataset aggregates several public sources such as MIMIC‑CXR, VinDr‑CXR and ChestX‑Det10. Each image may serve multiple tasks (e.g., report generation and classification). Due to compliance requirements the full dataset is not publicly released, but five sample examples per task are provided. The dataset description details file formats and usage for each sub‑dataset.

Updated 1/23/2024

hugging_face

Description

Dataset Description

OmniFM‑Dr introduces a multi‑task chest X‑ray dataset for joint training of disease classification, localization, segmentation, and report generation. The collection combines several publicly available datasets, such as MIMIC‑CXR, VinDr‑CXR, and ChestX‑Det10. Each image can contribute to multiple tasks, for example report generation and classification.

Note: Due to data‑compliance and regulatory constraints, the full dataset is currently unavailable. For each task, five example samples are provided.

Dataset Details

MIMIC: Over 377,110 radiographs from more than 227,835 radiology studies. Each image is paired with lesion classifications and the corresponding radiology report, supporting multi‑label classification and report‑generation tasks.
Padchest: 160,840 images from 67,000 patients covering six view positions. Various radiological findings are labeled for classification.
CXR‑AL14: Large‑scale chest X‑ray detection dataset with >140,000 images and 253,844 bounding boxes across 14 abnormality categories.
VinDr‑CXR: Chest radiographs for classification of 28 common diseases; 15,000 scans for training. Eight diseases with bounding boxes are selected for localization.
ChestX‑Det: 3,578 images from NIH ChestXray14 covering 13 diseases; seven diseases with bounding boxes are used for localization.
CheXmask: Lung and heart segmentation masks from six public databases, comprising 676,803 images. 224,316 are used for training; 10,000 from ChestXray14 serve as downstream evaluation.
SIIM: From the SIIM‑ACR pneumothorax segmentation challenge, 12,090 images with ~3,000 positive cases and associated masks.

Dataset Structure

MIMIC:
- MIMIC_classification_report-generation_xxx.tsv: classification & report‑generation (fields: id, report, "label1 && label2", subject_id, study_id, dicom_id).
- MIMIC_classification-location_xxx.tsv: localization VQA (fields: id, "label1,severity && label2,severity", subject_id, study_id, dicom_id).
- MIMIC_classification-severity_xxx.tsv: severity VQA (fields: id, "label, location1 & location2", subject_id, study_id, dicom_id).
Padchest:
- Padchest_classification_xxx.tsv: classification (fields: id, "label1 && label2", subject_id, study_id, dicom_id).
CXR‑AL14:
- CXR_AL14_localization_xxx.tsv: localization & classification (fields: id, label, "x1,y1,x2,y2", image_id).
VinDr‑CXR:
- VinDr_CXR_localization_xxx.tsv: same structure as above.
ChestX‑Det:
- ChestX_Det_localization_xxx.tsv: same structure as above.
CheXmask:
- CheXmask_segmentation_xxx.tsv: segmentation (fields: id, label, "x1,y1,x2,y2,…,x30,y30", subject_id, study_id, dicom_id).
SIIM:
- SIIM_segmentation_xxx.tsv: segmentation (same fields as CheXmask).

Dataset Usage

Run data_prepare.py to build training batches for all tasks. Each line should contain: id, instruction, label, image_id, and task_type.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Chest X‑ray Image Analysis

Multi‑Task Learning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →