leondz/wnut_17
The WNUT 17 dataset is a named entity recognition (NER) dataset focusing on identifying novel and rare entities in noisy text. It includes training (3,394 samples), validation (1,009 samples), and test (1,287 samples) sets. Each sample contains an ID, token list, and IOB2‑formatted NER labels covering entities such as companies, creative works, groups, locations, persons, and products. The dataset was created to provide definitions for emerging and rare entities and to support detection of such entities.
Dataset description and usage context
Dataset Overview
Dataset Name
- Name: WNUT 17
- Alias: wnut_17
Dataset Description
- Task: Emerging and rare entity recognition
- Language: English (en)
- License: CC‑BY‑4.0
- Source: Original data
- Data Type: Monolingual
- Scale: 1K < n < 10K
- Task Category: Token Classification
- Task ID: Named Entity Recognition
Dataset Structure
- Features:
id: string, example identifiertokens: list of strings, example text tokensner_tags: list of labels, IOB2‑formatted NER tags
- Splits:
train: 3,394 examplesvalidation: 1,009 examplestest: 1,287 examples
Annotation
- Annotators: Crowd‑sourced
- Language Creators: Discovery
Usage Notes
-
Citation:
@inproceedings{derczynski-etal-2017-results, title = "Results of the {WNUT}2017 Shared Task on Novel and Emerging Entity Recognition", author = "Derczynski, Leon and Nichols, Eric and van Erp, Marieke and Limsopatham, Nut", booktitle = "Proceedings of the 3rd Workshop on Noisy User‑generated Text", month = sep, year = "2017", address = "Copenhagen, Denmark", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W17-4418", doi = "10.18653/v1/W17-4418", pages = "140--147", abstract = "This shared task focuses on identifying unusual, previously‑unseen entities in the context of emerging discussions. ..." }
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.