Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingWord Sense Disambiguation
SemCor and Masc documents
Contains SemCor and Masc documents annotated with NOAD word senses for evaluating word‑sense disambiguation systems. Data are in XML format with detailed part‑of‑speech tags and segmentation information.
Source
github
Created
Dec 1, 2016
Updated
Jan 5, 2024
Signals
93 views
Availability
Linked source ready
Overview
Dataset description and usage context
word_sense_disambiguation_corpora
Dataset Overview
Contents
- SemCor and Masc documents, annotated with NOAD (New Oxford American Dictionary) word senses.
File Format
- XML format, following the simple-wsd-doc.dtd DTD.
Part‑of‑Speech Tags
- Punctuation: .
- Adjective: ADJ
- Adposition: ADP
- Adverb: ADV
- Conjunction: CONJ
- Determiner: DET
- Noun: NOUN
- Numeral: NUM
- Pronoun: PRON
- Particle: PRT
- Verb: VERB
- Other: X
Tokenization Levels
- No segmentation: NO_BREAK
- Space segmentation: SPACE_BREAK
- Line segmentation: LINE_BREAK
- Sentence segmentation: SENTENCE_BREAK
Sense Mapping
- manual_map.txt: manually created sense mapping.
- algorithmic_map.txt: algorithmically generated sense mapping.
Mapping format: NOAD_word_sense\tWordNet_word_senses (comma‑separated)
Data Accuracy
- Data were annotated via a crowdsourcing platform and are not guaranteed to be 100 % accurate.
Contact
- Person: Dayu Yuan
- Email: dayuyuan@google.com
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.