DATASET
Open Source Community
SemCor and Masc documents
Contains SemCor and Masc documents annotated with NOAD word senses for evaluating word‑sense disambiguation systems. Data are in XML format with detailed part‑of‑speech tags and segmentation information.
Updated 1/5/2024
github
Description
word_sense_disambiguation_corpora
Dataset Overview
Contents
- SemCor and Masc documents, annotated with NOAD (New Oxford American Dictionary) word senses.
File Format
- XML format, following the simple-wsd-doc.dtd DTD.
Part‑of‑Speech Tags
- Punctuation: .
- Adjective: ADJ
- Adposition: ADP
- Adverb: ADV
- Conjunction: CONJ
- Determiner: DET
- Noun: NOUN
- Numeral: NUM
- Pronoun: PRON
- Particle: PRT
- Verb: VERB
- Other: X
Tokenization Levels
- No segmentation: NO_BREAK
- Space segmentation: SPACE_BREAK
- Line segmentation: LINE_BREAK
- Sentence segmentation: SENTENCE_BREAK
Sense Mapping
- manual_map.txt: manually created sense mapping.
- algorithmic_map.txt: algorithmically generated sense mapping.
Mapping format: NOAD_word_sense\tWordNet_word_senses (comma‑separated)
Data Accuracy
- Data were annotated via a crowdsourcing platform and are not guaranteed to be 100 % accurate.
Contact
- Person: Dayu Yuan
- Email: dayuyuan@google.com
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Word Sense Disambiguation
Natural Language Processing
Source
Organization: github
Created: 12/1/2016
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.