Back to datasets
Dataset assetOpen Source CommunityText ClassificationNamed Entity Recognition

leondz/wnut_17

The WNUT 17 dataset is a named entity recognition (NER) dataset focusing on identifying novel and rare entities in noisy text. It includes training (3,394 samples), validation (1,009 samples), and test (1,287 samples) sets. Each sample contains an ID, token list, and IOB2‑formatted NER labels covering entities such as companies, creative works, groups, locations, persons, and products. The dataset was created to provide definitions for emerging and rare entities and to support detection of such entities.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 18, 2024
Signals
293 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

  • Name: WNUT 17
  • Alias: wnut_17

Dataset Description

  • Task: Emerging and rare entity recognition
  • Language: English (en)
  • License: CC‑BY‑4.0
  • Source: Original data
  • Data Type: Monolingual
  • Scale: 1K < n < 10K
  • Task Category: Token Classification
  • Task ID: Named Entity Recognition

Dataset Structure

  • Features:
    • id: string, example identifier
    • tokens: list of strings, example text tokens
    • ner_tags: list of labels, IOB2‑formatted NER tags
  • Splits:
    • train: 3,394 examples
    • validation: 1,009 examples
    • test: 1,287 examples

Annotation

  • Annotators: Crowd‑sourced
  • Language Creators: Discovery

Usage Notes

  • Citation:

    @inproceedings{derczynski-etal-2017-results, title = "Results of the {WNUT}2017 Shared Task on Novel and Emerging Entity Recognition", author = "Derczynski, Leon and Nichols, Eric and van Erp, Marieke and Limsopatham, Nut", booktitle = "Proceedings of the 3rd Workshop on Noisy User‑generated Text", month = sep, year = "2017", address = "Copenhagen, Denmark", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W17-4418", doi = "10.18653/v1/W17-4418", pages = "140--147", abstract = "This shared task focuses on identifying unusual, previously‑unseen entities in the context of emerging discussions. ..." }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio