MedHK23/OmniFM-Dr
The OmniFM‑Dr framework introduces a multi‑task chest X‑ray dataset for joint training of disease classification, localization, segmentation and report generation. The dataset aggregates several public sources such as MIMIC‑CXR, VinDr‑CXR and ChestX‑Det10. Each image may serve multiple tasks (e.g., report generation and classification). Due to compliance requirements the full dataset is not publicly released, but five sample examples per task are provided. The dataset description details file formats and usage for each sub‑dataset.
Description
Dataset Description
OmniFM‑Dr introduces a multi‑task chest X‑ray dataset for joint training of disease classification, localization, segmentation, and report generation. The collection combines several publicly available datasets, such as MIMIC‑CXR, VinDr‑CXR, and ChestX‑Det10. Each image can contribute to multiple tasks, for example report generation and classification.
Note: Due to data‑compliance and regulatory constraints, the full dataset is currently unavailable. For each task, five example samples are provided.
Dataset Details
- MIMIC: Over 377,110 radiographs from more than 227,835 radiology studies. Each image is paired with lesion classifications and the corresponding radiology report, supporting multi‑label classification and report‑generation tasks.
- Padchest: 160,840 images from 67,000 patients covering six view positions. Various radiological findings are labeled for classification.
- CXR‑AL14: Large‑scale chest X‑ray detection dataset with >140,000 images and 253,844 bounding boxes across 14 abnormality categories.
- VinDr‑CXR: Chest radiographs for classification of 28 common diseases; 15,000 scans for training. Eight diseases with bounding boxes are selected for localization.
- ChestX‑Det: 3,578 images from NIH ChestXray14 covering 13 diseases; seven diseases with bounding boxes are used for localization.
- CheXmask: Lung and heart segmentation masks from six public databases, comprising 676,803 images. 224,316 are used for training; 10,000 from ChestXray14 serve as downstream evaluation.
- SIIM: From the SIIM‑ACR pneumothorax segmentation challenge, 12,090 images with ~3,000 positive cases and associated masks.
Dataset Structure
- MIMIC:
MIMIC_classification_report-generation_xxx.tsv: classification & report‑generation (fields: id, report, "label1 && label2", subject_id, study_id, dicom_id).MIMIC_classification-location_xxx.tsv: localization VQA (fields: id, "label1,severity && label2,severity", subject_id, study_id, dicom_id).MIMIC_classification-severity_xxx.tsv: severity VQA (fields: id, "label, location1 & location2", subject_id, study_id, dicom_id).
- Padchest:
Padchest_classification_xxx.tsv: classification (fields: id, "label1 && label2", subject_id, study_id, dicom_id).
- CXR‑AL14:
CXR_AL14_localization_xxx.tsv: localization & classification (fields: id, label, "x1,y1,x2,y2", image_id).
- VinDr‑CXR:
VinDr_CXR_localization_xxx.tsv: same structure as above.
- ChestX‑Det:
ChestX_Det_localization_xxx.tsv: same structure as above.
- CheXmask:
CheXmask_segmentation_xxx.tsv: segmentation (fields: id, label, "x1,y1,x2,y2,…,x30,y30", subject_id, study_id, dicom_id).
- SIIM:
SIIM_segmentation_xxx.tsv: segmentation (same fields as CheXmask).
Dataset Usage
Run data_prepare.py to build training batches for all tasks. Each line should contain: id, instruction, label, image_id, and task_type.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.