RaTE-NER
The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC-IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT-4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.
Description
Dataset Overview
Dataset Introduction
The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC‑IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT‑4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.
File Structure
The dataset file structure is as follows:
├── [MIMIC_IV]
│ ├── dev_IOB.json
│ ├── dev_span.json
│ ├── test_IOB.json
│ ├── test_span.json
│ ├── train_IOB.json
│ └── train_span.json
├── [Radiopaedia]
│ ├── dev_span.json
│ ├── dev_IOB.json
│ ├── test_IOB.json
│ ├── test_span.json
│ ├── train_span.json
│ └── train_IOB.json
└── [all]
├── dev_IOB.json
├── dev_span.json
├── test_IOB.json
├── test_span.json
├── train_IOB.json
└── train_span.json
Each data type provides two preprocessing formats to support different NER methods: an IOB (Inside, Outside, Beginning) tag‑based preprocessing and a span tag‑based preprocessing.
- IOB preprocessing format includes three fields:
id,tokens,ner_tags. - Span preprocessing format includes three fields:
note_id,sentence,ner.
Usage
from datasets import load_dataset
data = load_dataset("Angelakeke/RaTE-NER")
Author
Author: Weike Zhao For any questions, please contact zwk0629@sjtu.edu.cn.
Citation
If you find the data/paper useful, please consider citing:
@article{zhao2024ratescore,
title={RaTEScore: A Metric for Radiology Report Generation},
author={Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
journal={arXiv preprint arXiv:2406.16845},
year={2024}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 6/20/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.