Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingWord Sense Disambiguation

SemCor and Masc documents

Contains SemCor and Masc documents annotated with NOAD word senses for evaluating word‑sense disambiguation systems. Data are in XML format with detailed part‑of‑speech tags and segmentation information.

Source
github
Created
Dec 1, 2016
Updated
Jan 5, 2024
Signals
93 views
Availability
Linked source ready
Overview

Dataset description and usage context

word_sense_disambiguation_corpora

Dataset Overview

Contents

  • SemCor and Masc documents, annotated with NOAD (New Oxford American Dictionary) word senses.

File Format

  • XML format, following the simple-wsd-doc.dtd DTD.

Part‑of‑Speech Tags

  • Punctuation: .
  • Adjective: ADJ
  • Adposition: ADP
  • Adverb: ADV
  • Conjunction: CONJ
  • Determiner: DET
  • Noun: NOUN
  • Numeral: NUM
  • Pronoun: PRON
  • Particle: PRT
  • Verb: VERB
  • Other: X

Tokenization Levels

  • No segmentation: NO_BREAK
  • Space segmentation: SPACE_BREAK
  • Line segmentation: LINE_BREAK
  • Sentence segmentation: SENTENCE_BREAK

Sense Mapping

  • manual_map.txt: manually created sense mapping.
  • algorithmic_map.txt: algorithmically generated sense mapping.

Mapping format: NOAD_word_sense\tWordNet_word_senses (comma‑separated)

Data Accuracy

  • Data were annotated via a crowdsourcing platform and are not guaranteed to be 100 % accurate.

Contact

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio