RaTE-NER

The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC-IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT-4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.

Updated 6/21/2024

huggingface

Description

Dataset Overview

Dataset Introduction

The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC‑IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT‑4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.

File Structure

The dataset file structure is as follows:

├── [MIMIC_IV]
│   ├── dev_IOB.json
│   ├── dev_span.json
│   ├── test_IOB.json
│   ├── test_span.json
│   ├── train_IOB.json
│   └── train_span.json
├── [Radiopaedia]
│   ├── dev_span.json
│   ├── dev_IOB.json
│   ├── test_IOB.json
│   ├── test_span.json
│   ├── train_span.json
│   └── train_IOB.json
└── [all]
    ├── dev_IOB.json
    ├── dev_span.json
    ├── test_IOB.json
    ├── test_span.json
    ├── train_IOB.json
    └── train_span.json

Each data type provides two preprocessing formats to support different NER methods: an IOB (Inside, Outside, Beginning) tag‑based preprocessing and a span tag‑based preprocessing.

IOB preprocessing format includes three fields: id, tokens, ner_tags.
Span preprocessing format includes three fields: note_id, sentence, ner.

Usage

from datasets import load_dataset
data = load_dataset("Angelakeke/RaTE-NER")

Author

Author: Weike Zhao For any questions, please contact zwk0629@sjtu.edu.cn.

Citation

If you find the data/paper useful, please consider citing:

@article{zhao2024ratescore,
  title={RaTEScore: A Metric for Radiology Report Generation},
  author={Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
  journal={arXiv preprint arXiv:2406.16845},
  year={2024}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Radiology

Named Entity Recognition

Source

Organization: huggingface

Created: 6/20/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →