MedNorm corpus
The MedNorm corpus is a dataset and embedding collection for cross‑terminology medical concept normalization, which combines instances from multiple datasets and provides consistent simultaneous mappings to MedDRA and SNOMED‑CT terms.
Description
Dataset Overview
Dataset Name
- MedNorm Corpus
Dataset Purpose
- Combine multiple datasets to provide consistent simultaneous mappings to MedDRA and SNOMED‑CT terminologies.
- Generate a corpus graph and cross‑terminology concept embeddings.
Dataset Content
- Contains instances from several datasets, specifically:
- CADEC
- TwADR‑L
- TwiMed‑PubMed
- TwiMed‑Twitter
- SMM4H2017‑train
- SMM4H2017‑test
- TAC2017_ADR
Data Processing Steps
-
Data Set Merging
- Use the
dataset.py combinecommand to merge the sets, producing themednorm_raw.tsvfile. - Result: 30,246 lines.
- Use the
-
Build Initial Corpus Graph
- Use
dataset.py build_graphto construct the graph representation.
- Use
-
Build Concept Embedding Model
- Use
dataset.py build_embeddingsto generate the embedding model.
- Use
-
Identify Potential Annotation Errors
- Use
dataset.py unrelated_annotationsanddataset.py ambiguous_tokensto analyze and locate errors.
- Use
-
Correct Annotation Errors
- Use
dataset.py human_correctfor manual correction.
- Use
-
Build Final Graph Representation
- Use
dataset.py build_graphagain on the corrected data.
- Use
-
Generate TSV Dataset
- Use
dataset.py tsvto producemednorm_mapped_draft.tsv. - Result: 27,979 lines.
- Use
-
Resolve Phrase Duplicates
- Use
dataset.py resolve_dupsto handle duplicate phrases. - Changes: 6,667 rows modified.
- Use
-
Single‑Label Simplification
- Use
dataset.py reduceto collapse to single labels. - Outcome: 2,080 single‑label MedDRA codes, 2,100 single‑label SCT IDs.
- Use
-
Filtering
- Use
dataset.py filterfor data filtering.
- Use
Dataset Access
- The corpus and embeddings are available at: https://doi.org/10.17632/b9x7xxb9sz.1
Citation Information
- Citation: Belousov, Maksim, et al. "MedNorm: A Corpus and Embeddings for Cross‑terminology Medical Concept Normalisation." Proceedings of the Fourth Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task, 2019, pp. 31‑39.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 6/3/2019
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.