Back to datasets
Dataset assetOpen Source CommunityNamed Entity RecognitionRadiology

RaTE-NER

The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC-IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT-4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.

Source
huggingface
Created
Jun 20, 2024
Updated
Jun 21, 2024
Signals
164 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Introduction

The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC‑IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT‑4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.

File Structure

The dataset file structure is as follows:

├── [MIMIC_IV]
│   ├── dev_IOB.json
│   ├── dev_span.json
│   ├── test_IOB.json
│   ├── test_span.json
│   ├── train_IOB.json
│   └── train_span.json
├── [Radiopaedia]
│   ├── dev_span.json
│   ├── dev_IOB.json
│   ├── test_IOB.json
│   ├── test_span.json
│   ├── train_span.json
│   └── train_IOB.json
└── [all]
    ├── dev_IOB.json
    ├── dev_span.json
    ├── test_IOB.json
    ├── test_span.json
    ├── train_IOB.json
    └── train_span.json

Each data type provides two preprocessing formats to support different NER methods: an IOB (Inside, Outside, Beginning) tag‑based preprocessing and a span tag‑based preprocessing.

  • IOB preprocessing format includes three fields: id, tokens, ner_tags.
  • Span preprocessing format includes three fields: note_id, sentence, ner.

Usage

from datasets import load_dataset
data = load_dataset("Angelakeke/RaTE-NER")

Author

Author: Weike Zhao For any questions, please contact zwk0629@sjtu.edu.cn.

Citation

If you find the data/paper useful, please consider citing:

@article{zhao2024ratescore,
  title={RaTEScore: A Metric for Radiology Report Generation},
  author={Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
  journal={arXiv preprint arXiv:2406.16845},
  year={2024}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio