Dataset assetOpen Source CommunityMedical ImagingElectronic Health Records

EHRXQA

A novel multimodal question‑answering dataset that combines structured electronic health records and chest X‑ray images, intended to promote joint reasoning of image and table modalities in EHR QA systems.

Source

arXiv

Created

Oct 28, 2023

Updated

Dec 26, 2023

Signals

331 views

Availability

Linked source ready

Overview

Dataset description and usage context

EHRXQA: A Multi‑Modal Question Answering Dataset for Electronic Health Records with Chest X‑ray Images

Overview

EHRXQA is a multimodal QA dataset that combines structured electronic health records (EHRs) and chest X‑ray images, aiming to advance joint reasoning between image and tabular modalities in EHR QA systems. The dataset is constructed by integrating two single‑modal resources: 1) the MIMIC‑CXR‑VQA dataset, a novel medical visual QA benchmark specifically designed to enrich the image modality in EHR QA; 2) EHRSQL (MIMIC‑IV), a redesigned tabular EHR QA dataset. By merging these two resources, a multimodal EHR QA dataset requiring both single‑modal and cross‑modal reasoning is created.

Updates

July 24, 2024: Released the EHRXQA dataset on PhysioNet.
December 12, 2023: Presented our work at the NeurIPS 2023 Datasets and Benchmarks Track.
October 28, 2023: Published our research paper on arXiv.

Features

Scripts to download source datasets (MIMIC‑CXR‑JPG, Chest ImaGenome, and MIMIC‑IV).
Scripts to preprocess source datasets.
Scripts to build the integrated database (MIMIC‑IV and MIMIC‑CXR).
Scripts to generate the EHRXQA dataset with answer information.

Installation

For Linux:

Ensure Python 3.8.5 or higher is installed. Set up the environment and install required packages:

# Set up environment
conda create --name ehrxqa python=3.8.5

# Activate environment
conda activate ehrxqa

# Install required packages
pip install pandas==1.1.3 tqdm==4.65.0 scikit-learn==0.23.2 
pip install dask==2022.12.1

Setup

Clone the repository and navigate into it:

git clone https://github.com/baeseongsu/ehrxqa.git
cd ehrxqa

Usage

Privacy

We take data privacy seriously. All data accessed through this repository have been carefully prepared to prevent any privacy or data leakage. You can use the data with confidence as all necessary safeguards are in place.

Access Requirements

EHRXQA is built from MIMIC‑CXR‑JPG (v2.0.0), Chest ImaGenome (v1.0.0) and MIMIC‑IV (v2.2). All of these source datasets require an authenticated PhysioNet license. Consequently, only certified users may access the MIMIC‑CXR‑VQA files. To obtain the source datasets you must satisfy all of the following:

Become a certified user
- Register a PhysioNet account here: https://physionet.org/register/
- Follow the credentialing instructions: https://physionet.org/credential-application/
- Complete the “CITI Data or Specimens for Research Only” course: https://physionet.org/about/citi-course/
Sign the Data Use Agreement for each project

Accessing the EHRXQA Dataset

While a full release is pending on PhysioNet, we provide partial access through this repository for certified users. To obtain the dataset, run the build_dataset.sh script (requires your PhysioNet credentials):

bash build_dataset.sh

During execution, provide your PhysioNet credentials:

Username: enter your PhysioNet username and press Enter.
Password: enter your PhysioNet password and press Enter. The password will not be displayed.

The script performs: 1) downloading source datasets from PhysioNet, 2) preprocessing them, and 3) generating the complete EHRXQA dataset with true answer information.

Dataset Structure

ehrxqa
└── dataset
    ├── _train_.json
    ├── _valid.json
    ├── _test.json
    ├── train.json (available after script execution)
    ├── valid.json (available after script execution)
    └── test.json  (available after script execution)

ehrxqa is the root directory. The dataset folder contains the various JSON files of the EHRXQA dataset.
_train.json, _valid.json, and _test.json are pre‑release versions that are intentionally incomplete to protect privacy; they exclude certain key information such as answers.
After running the main script with valid credentials, the full versions (train.json, valid.json, test.json) are generated and contain complete information, including answers for each entry.

Dataset Description

Each QA sample in EHRXQA is stored in a separate .json file. Every file contains a Python list of dictionaries where each key denotes:

db_id: string representing the corresponding database ID.
split: dataset split type (e.g., train, test, validation).
id: unique identifier for the instance.
question: paraphrased version of the question.
template: final question template created by injecting real database values into the label; this is the fully specified and contextualized form.
query: the corresponding NeuralSQL/SQL query for the question.
value: specific key‑value pairs related to the question, sampled from the database.
q_tag: initial sampled question template, serving as the base structure.
t_tag: sampled time template, providing temporal context and specificity.
o_tag: sampled operation value for the query, usually containing numeric or calculational aspects needed to form the question.
v_tag: sampled visual values, including objects, categories, attributes, and comparisons, adding detail to the question.
tag: composite tag merging the enhanced q_tag with additional elements (t_tag, o_tag, v_tag), representing an intermediate, more specific question template before final templating.
para_type: source of the paraphrase, either a generic machine‑generated tool or GPT‑4.
is_impossible: boolean indicating whether the question can be answered using the dataset.
_gold_program: temporary program used to generate the answer.

After verifying PhysioNet credentials, the create_answer.py script generates:

answer: answer string derived from executing the query.

Example

{
    "db_id": "mimic_iv_cxr", 
    "split": "train",
    "id": 0, 
    "question": "how many days have passed since the last chest x-ray of patient 18679317 depicting any anatomical findings in 2105?",
    "template": "how many days have passed since the last time patient 18679317 had a chest x‑ray study indicating any anatomicalfinding in 2105?",
    "query": "select 1 * ( strftime(%J,current_time) - strftime(%J,t1.studydatetime) ) from ( select tb_cxr.study_id, tb_cxr.studydatetime from tb_cxr where tb_cxr.study_id in ( select distinct tb_cxr.study_id from tb_cxr where tb_cxr.subject_id = 18679317 and strftime(%Y,tb_cxr.studydatetime) = 2105 ) ) as t1 where func_vqa(\"is the chest x‑ray depicting any anatomical findings?\", t1.study_id) = true",
    "value": {"patient_id": 18679317},
    "q_tag": "how many [unit_count] have passed since the [time_filter_exact1] time patient {patient_id} had a chest x‑ray study indicating any ${category} [time_filter_global1]?",
    "t_tag": ["abs-year-in", "", "", "exact-last", ""],
    "o_tag": {"unit_count": {"nlq": "days", "sql": "1 * ", "type": "days", "sql_pattern": "[unit_count]"}},
    "v_tag": {"object": [], "category": ["anatomicalfinding"], "attribute": []},
    "tag": "how many [unit_count:days] have passed since the [time_filter_exact1:exact-last] time patient {patient_id} had a chest x‑ray study indicating any anatomicalfinding [time_filter_global1:abs-year-in]?",
    "para_type": "machine", 
    "is_impossible": false,
    "answer": "Will be generated by dataset_builder/generate_answer.py"
}

Versioning

We adopt semantic versioning; the current version is v1.0.0. Typically we maintain and provide only the latest version. In cases of major updates or the need to validate prior research, we may retain older versions for up to one year. See the CHANGELOG for a detailed list of changes per version.

Contributions

Contributions to improve the dataset’s usability and functionality are welcome. Feel free to fork the repository, make changes, and submit a pull request. For substantial changes, open an issue first to discuss the proposed modifications.

Contact

For any questions or concerns regarding this dataset, please contact us at (seongsu@kaist.ac.kr or kyungdaeun@kaist.ac.kr). We appreciate your interest and are happy to help.

Citation

When using the EHRXQA dataset, please cite the following:

@article{bae2023ehrxqa,
  title={EHRXQA: A Multi‑Modal Question Answering Dataset for Electronic Health Records with Chest X‑ray Images},
  author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I and Kim, Tackeun and others},
  journal={arXiv preprint arXiv:2310.18652},
  year={2023}
}

License

The code in this repository is released under the MIT license. The final EHRXQA dataset generated from this code is subject to the terms of the original PhysioNet source datasets: MIMIC‑CXR‑JPG License, Chest ImaGenome License, and MIMIC‑IV License.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio