Back to datasets
Dataset assetOpen Source CommunityCorpusMedical Terminology Normalization
MedNorm corpus
The MedNorm corpus is a dataset and embedding collection for cross‑terminology medical concept normalization, which combines instances from multiple datasets and provides consistent simultaneous mappings to MedDRA and SNOMED‑CT terms.
Source
github
Created
Jun 3, 2019
Updated
Aug 27, 2022
Signals
259 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Name
- MedNorm Corpus
Dataset Purpose
- Combine multiple datasets to provide consistent simultaneous mappings to MedDRA and SNOMED‑CT terminologies.
- Generate a corpus graph and cross‑terminology concept embeddings.
Dataset Content
- Contains instances from several datasets, specifically:
- CADEC
- TwADR‑L
- TwiMed‑PubMed
- TwiMed‑Twitter
- SMM4H2017‑train
- SMM4H2017‑test
- TAC2017_ADR
Data Processing Steps
-
Data Set Merging
- Use the
dataset.py combinecommand to merge the sets, producing themednorm_raw.tsvfile. - Result: 30,246 lines.
- Use the
-
Build Initial Corpus Graph
- Use
dataset.py build_graphto construct the graph representation.
- Use
-
Build Concept Embedding Model
- Use
dataset.py build_embeddingsto generate the embedding model.
- Use
-
Identify Potential Annotation Errors
- Use
dataset.py unrelated_annotationsanddataset.py ambiguous_tokensto analyze and locate errors.
- Use
-
Correct Annotation Errors
- Use
dataset.py human_correctfor manual correction.
- Use
-
Build Final Graph Representation
- Use
dataset.py build_graphagain on the corrected data.
- Use
-
Generate TSV Dataset
- Use
dataset.py tsvto producemednorm_mapped_draft.tsv. - Result: 27,979 lines.
- Use
-
Resolve Phrase Duplicates
- Use
dataset.py resolve_dupsto handle duplicate phrases. - Changes: 6,667 rows modified.
- Use
-
Single‑Label Simplification
- Use
dataset.py reduceto collapse to single labels. - Outcome: 2,080 single‑label MedDRA codes, 2,100 single‑label SCT IDs.
- Use
-
Filtering
- Use
dataset.py filterfor data filtering.
- Use
Dataset Access
- The corpus and embeddings are available at: https://doi.org/10.17632/b9x7xxb9sz.1
Citation Information
- Citation: Belousov, Maksim, et al. "MedNorm: A Corpus and Embeddings for Cross‑terminology Medical Concept Normalisation." Proceedings of the Fourth Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task, 2019, pp. 31‑39.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.