Back to datasets
Dataset assetOpen Source CommunityCorpusMedical Terminology Normalization

MedNorm corpus

The MedNorm corpus is a dataset and embedding collection for cross‑terminology medical concept normalization, which combines instances from multiple datasets and provides consistent simultaneous mappings to MedDRA and SNOMED‑CT terms.

Source
github
Created
Jun 3, 2019
Updated
Aug 27, 2022
Signals
259 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

  • MedNorm Corpus

Dataset Purpose

  • Combine multiple datasets to provide consistent simultaneous mappings to MedDRA and SNOMED‑CT terminologies.
  • Generate a corpus graph and cross‑terminology concept embeddings.

Dataset Content

  • Contains instances from several datasets, specifically:
    • CADEC
    • TwADR‑L
    • TwiMed‑PubMed
    • TwiMed‑Twitter
    • SMM4H2017‑train
    • SMM4H2017‑test
    • TAC2017_ADR

Data Processing Steps

  1. Data Set Merging

    • Use the dataset.py combine command to merge the sets, producing the mednorm_raw.tsv file.
    • Result: 30,246 lines.
  2. Build Initial Corpus Graph

    • Use dataset.py build_graph to construct the graph representation.
  3. Build Concept Embedding Model

    • Use dataset.py build_embeddings to generate the embedding model.
  4. Identify Potential Annotation Errors

    • Use dataset.py unrelated_annotations and dataset.py ambiguous_tokens to analyze and locate errors.
  5. Correct Annotation Errors

    • Use dataset.py human_correct for manual correction.
  6. Build Final Graph Representation

    • Use dataset.py build_graph again on the corrected data.
  7. Generate TSV Dataset

    • Use dataset.py tsv to produce mednorm_mapped_draft.tsv.
    • Result: 27,979 lines.
  8. Resolve Phrase Duplicates

    • Use dataset.py resolve_dups to handle duplicate phrases.
    • Changes: 6,667 rows modified.
  9. Single‑Label Simplification

    • Use dataset.py reduce to collapse to single labels.
    • Outcome: 2,080 single‑label MedDRA codes, 2,100 single‑label SCT IDs.
  10. Filtering

    • Use dataset.py filter for data filtering.

Dataset Access

Citation Information

  • Citation: Belousov, Maksim, et al. "MedNorm: A Corpus and Embeddings for Cross‑terminology Medical Concept Normalisation." Proceedings of the Fourth Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task, 2019, pp. 31‑39.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio