RaTE-NER
The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC-IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT-4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.
Dataset description and usage context
Dataset Overview
Dataset Introduction
The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC‑IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT‑4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.
File Structure
The dataset file structure is as follows:
├── [MIMIC_IV]
│ ├── dev_IOB.json
│ ├── dev_span.json
│ ├── test_IOB.json
│ ├── test_span.json
│ ├── train_IOB.json
│ └── train_span.json
├── [Radiopaedia]
│ ├── dev_span.json
│ ├── dev_IOB.json
│ ├── test_IOB.json
│ ├── test_span.json
│ ├── train_span.json
│ └── train_IOB.json
└── [all]
├── dev_IOB.json
├── dev_span.json
├── test_IOB.json
├── test_span.json
├── train_IOB.json
└── train_span.json
Each data type provides two preprocessing formats to support different NER methods: an IOB (Inside, Outside, Beginning) tag‑based preprocessing and a span tag‑based preprocessing.
- IOB preprocessing format includes three fields:
id,tokens,ner_tags. - Span preprocessing format includes three fields:
note_id,sentence,ner.
Usage
from datasets import load_dataset
data = load_dataset("Angelakeke/RaTE-NER")
Author
Author: Weike Zhao For any questions, please contact zwk0629@sjtu.edu.cn.
Citation
If you find the data/paper useful, please consider citing:
@article{zhao2024ratescore,
title={RaTEScore: A Metric for Radiology Report Generation},
author={Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
journal={arXiv preprint arXiv:2406.16845},
year={2024}
}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.