EHRXQA
A novel multimodal question‑answering dataset that combines structured electronic health records and chest X‑ray images, intended to promote joint reasoning of image and table modalities in EHR QA systems.
Description
EHRXQA: A Multi‑Modal Question Answering Dataset for Electronic Health Records with Chest X‑ray Images
Overview
EHRXQA is a multimodal QA dataset that combines structured electronic health records (EHRs) and chest X‑ray images, aiming to advance joint reasoning between image and tabular modalities in EHR QA systems. The dataset is constructed by integrating two single‑modal resources: 1) the MIMIC‑CXR‑VQA dataset, a novel medical visual QA benchmark specifically designed to enrich the image modality in EHR QA; 2) EHRSQL (MIMIC‑IV), a redesigned tabular EHR QA dataset. By merging these two resources, a multimodal EHR QA dataset requiring both single‑modal and cross‑modal reasoning is created.
Updates
- July 24, 2024: Released the EHRXQA dataset on PhysioNet.
- December 12, 2023: Presented our work at the NeurIPS 2023 Datasets and Benchmarks Track.
- October 28, 2023: Published our research paper on arXiv.
Features
- Scripts to download source datasets (MIMIC‑CXR‑JPG, Chest ImaGenome, and MIMIC‑IV).
- Scripts to preprocess source datasets.
- Scripts to build the integrated database (MIMIC‑IV and MIMIC‑CXR).
- Scripts to generate the EHRXQA dataset with answer information.
Installation
For Linux:
Ensure Python 3.8.5 or higher is installed. Set up the environment and install required packages:
# Set up environment
conda create --name ehrxqa python=3.8.5
# Activate environment
conda activate ehrxqa
# Install required packages
pip install pandas==1.1.3 tqdm==4.65.0 scikit-learn==0.23.2
pip install dask==2022.12.1
Setup
Clone the repository and navigate into it:
git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa
Usage
Privacy
We take data privacy seriously. All data accessed through this repository have been carefully prepared to prevent any privacy or data leakage. You can use the data with confidence as all necessary safeguards are in place.
Access Requirements
EHRXQA is built from MIMIC‑CXR‑JPG (v2.0.0), Chest ImaGenome (v1.0.0) and MIMIC‑IV (v2.2). All of these source datasets require an authenticated PhysioNet license. Consequently, only certified users may access the MIMIC‑CXR‑VQA files. To obtain the source datasets you must satisfy all of the following:
- Become a certified user
- Register a PhysioNet account here: https://physionet.org/register/
- Follow the credentialing instructions: https://physionet.org/credential-application/
- Complete the “CITI Data or Specimens for Research Only” course: https://physionet.org/about/citi-course/
- Sign the Data Use Agreement for each project
Accessing the EHRXQA Dataset
While a full release is pending on PhysioNet, we provide partial access through this repository for certified users. To obtain the dataset, run the build_dataset.sh script (requires your PhysioNet credentials):
bash build_dataset.sh
During execution, provide your PhysioNet credentials:
- Username: enter your PhysioNet username and press
Enter. - Password: enter your PhysioNet password and press
Enter. The password will not be displayed.
The script performs: 1) downloading source datasets from PhysioNet, 2) preprocessing them, and 3) generating the complete EHRXQA dataset with true answer information.
Dataset Structure
ehrxqa
└── dataset
├── _train_.json
├── _valid.json
├── _test.json
├── train.json (available after script execution)
├── valid.json (available after script execution)
└── test.json (available after script execution)
ehrxqais the root directory. Thedatasetfolder contains the various JSON files of the EHRXQA dataset._train.json,_valid.json, and_test.jsonare pre‑release versions that are intentionally incomplete to protect privacy; they exclude certain key information such as answers.- After running the main script with valid credentials, the full versions (
train.json,valid.json,test.json) are generated and contain complete information, including answers for each entry.
Dataset Description
Each QA sample in EHRXQA is stored in a separate .json file. Every file contains a Python list of dictionaries where each key denotes:
db_id: string representing the corresponding database ID.split: dataset split type (e.g., train, test, validation).id: unique identifier for the instance.question: paraphrased version of the question.template: final question template created by injecting real database values into the label; this is the fully specified and contextualized form.query: the corresponding NeuralSQL/SQL query for the question.value: specific key‑value pairs related to the question, sampled from the database.q_tag: initial sampled question template, serving as the base structure.t_tag: sampled time template, providing temporal context and specificity.o_tag: sampled operation value for the query, usually containing numeric or calculational aspects needed to form the question.v_tag: sampled visual values, including objects, categories, attributes, and comparisons, adding detail to the question.tag: composite tag merging the enhancedq_tagwith additional elements (t_tag,o_tag,v_tag), representing an intermediate, more specific question template before final templating.para_type: source of the paraphrase, either a generic machine‑generated tool or GPT‑4.is_impossible: boolean indicating whether the question can be answered using the dataset._gold_program: temporary program used to generate the answer.
After verifying PhysioNet credentials, the create_answer.py script generates:
answer: answer string derived from executing the query.
Example
{
"db_id": "mimic_iv_cxr",
"split": "train",
"id": 0,
"question": "how many days have passed since the last chest x-ray of patient 18679317 depicting any anatomical findings in 2105?",
"template": "how many days have passed since the last time patient 18679317 had a chest x‑ray study indicating any anatomicalfinding in 2105?",
"query": "select 1 * ( strftime(%J,current_time) - strftime(%J,t1.studydatetime) ) from ( select tb_cxr.study_id, tb_cxr.studydatetime from tb_cxr where tb_cxr.study_id in ( select distinct tb_cxr.study_id from tb_cxr where tb_cxr.subject_id = 18679317 and strftime(%Y,tb_cxr.studydatetime) = 2105 ) ) as t1 where func_vqa(\"is the chest x‑ray depicting any anatomical findings?\", t1.study_id) = true",
"value": {"patient_id": 18679317},
"q_tag": "how many [unit_count] have passed since the [time_filter_exact1] time patient {patient_id} had a chest x‑ray study indicating any ${category} [time_filter_global1]?",
"t_tag": ["abs-year-in", "", "", "exact-last", ""],
"o_tag": {"unit_count": {"nlq": "days", "sql": "1 * ", "type": "days", "sql_pattern": "[unit_count]"}},
"v_tag": {"object": [], "category": ["anatomicalfinding"], "attribute": []},
"tag": "how many [unit_count:days] have passed since the [time_filter_exact1:exact-last] time patient {patient_id} had a chest x‑ray study indicating any anatomicalfinding [time_filter_global1:abs-year-in]?",
"para_type": "machine",
"is_impossible": false,
"answer": "Will be generated by dataset_builder/generate_answer.py"
}
Versioning
We adopt semantic versioning; the current version is v1.0.0. Typically we maintain and provide only the latest version. In cases of major updates or the need to validate prior research, we may retain older versions for up to one year. See the CHANGELOG for a detailed list of changes per version.
Contributions
Contributions to improve the dataset’s usability and functionality are welcome. Feel free to fork the repository, make changes, and submit a pull request. For substantial changes, open an issue first to discuss the proposed modifications.
Contact
For any questions or concerns regarding this dataset, please contact us at (seongsu@kaist.ac.kr or kyungdaeun@kaist.ac.kr). We appreciate your interest and are happy to help.
Citation
When using the EHRXQA dataset, please cite the following:
@article{bae2023ehrxqa,
title={EHRXQA: A Multi‑Modal Question Answering Dataset for Electronic Health Records with Chest X‑ray Images},
author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I and Kim, Tackeun and others},
journal={arXiv preprint arXiv:2310.18652},
year={2023}
}
License
The code in this repository is released under the MIT license. The final EHRXQA dataset generated from this code is subject to the terms of the original PhysioNet source datasets: MIMIC‑CXR‑JPG License, Chest ImaGenome License, and MIMIC‑IV License.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 10/28/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.