Dataset assetOpen Source CommunityText ClassificationNamed Entity Recognition

leondz/wnut_17

The WNUT 17 dataset is a named entity recognition (NER) dataset focusing on identifying novel and rare entities in noisy text. It includes training (3,394 samples), validation (1,009 samples), and test (1,287 samples) sets. Each sample contains an ID, token list, and IOB2‑formatted NER labels covering entities such as companies, creative works, groups, locations, persons, and products. The dataset was created to provide definitions for emerging and rare entities and to support detection of such entities.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jan 18, 2024

Signals

293 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Name: WNUT 17
Alias: wnut_17

Dataset Description

Task: Emerging and rare entity recognition
Language: English (en)
License: CC‑BY‑4.0
Source: Original data
Data Type: Monolingual
Scale: 1K < n < 10K
Task Category: Token Classification
Task ID: Named Entity Recognition

Dataset Structure

Features:
- id: string, example identifier
- tokens: list of strings, example text tokens
- ner_tags: list of labels, IOB2‑formatted NER tags
Splits:
- train: 3,394 examples
- validation: 1,009 examples
- test: 1,287 examples

Annotation

Annotators: Crowd‑sourced
Language Creators: Discovery

Usage Notes

Citation:

@inproceedings{derczynski-etal-2017-results, title = "Results of the {WNUT}2017 Shared Task on Novel and Emerging Entity Recognition", author = "Derczynski, Leon and Nichols, Eric and van Erp, Marieke and Limsopatham, Nut", booktitle = "Proceedings of the 3rd Workshop on Noisy User‑generated Text", month = sep, year = "2017", address = "Copenhagen, Denmark", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W17-4418", doi = "10.18653/v1/W17-4418", pages = "140--147", abstract = "This shared task focuses on identifying unusual, previously‑unseen entities in the context of emerging discussions. ..." }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio