Dataset assetOpen Source CommunityNatural Language ProcessingWord Sense Disambiguation

SemCor and Masc documents

Contains SemCor and Masc documents annotated with NOAD word senses for evaluating word‑sense disambiguation systems. Data are in XML format with detailed part‑of‑speech tags and segmentation information.

Source

github

Created

Dec 1, 2016

Updated

Jan 5, 2024

Signals

93 views

Availability

Linked source ready

Overview

Dataset description and usage context

word_sense_disambiguation_corpora

Dataset Overview

SemCor and Masc documents, annotated with NOAD (New Oxford American Dictionary) word senses.

File Format

XML format, following the simple-wsd-doc.dtd DTD.

Tokenization Levels

No segmentation: NO_BREAK
Space segmentation: SPACE_BREAK
Line segmentation: LINE_BREAK
Sentence segmentation: SENTENCE_BREAK

Sense Mapping

manual_map.txt: manually created sense mapping.
algorithmic_map.txt: algorithmically generated sense mapping.

Mapping format: NOAD_word_sense\tWordNet_word_senses (comma‑separated)

Data Accuracy

Data were annotated via a crowdsourcing platform and are not guaranteed to be 100 % accurate.

Contact

Person: Dayu Yuan
Email: dayuyuan@google.com

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio