JUHE API Marketplace
DATASET
Open Source Community

leondz/wnut_17

The WNUT 17 dataset is a named entity recognition (NER) dataset focusing on identifying novel and rare entities in noisy text. It includes training (3,394 samples), validation (1,009 samples), and test (1,287 samples) sets. Each sample contains an ID, token list, and IOB2‑formatted NER labels covering entities such as companies, creative works, groups, locations, persons, and products. The dataset was created to provide definitions for emerging and rare entities and to support detection of such entities.

Updated 1/18/2024
hugging_face

Description

Dataset Overview

Dataset Name

  • Name: WNUT 17
  • Alias: wnut_17

Dataset Description

  • Task: Emerging and rare entity recognition
  • Language: English (en)
  • License: CC‑BY‑4.0
  • Source: Original data
  • Data Type: Monolingual
  • Scale: 1K < n < 10K
  • Task Category: Token Classification
  • Task ID: Named Entity Recognition

Dataset Structure

  • Features:
    • id: string, example identifier
    • tokens: list of strings, example text tokens
    • ner_tags: list of labels, IOB2‑formatted NER tags
  • Splits:
    • train: 3,394 examples
    • validation: 1,009 examples
    • test: 1,287 examples

Annotation

  • Annotators: Crowd‑sourced
  • Language Creators: Discovery

Usage Notes

  • Citation:

    @inproceedings{derczynski-etal-2017-results, title = "Results of the {WNUT}2017 Shared Task on Novel and Emerging Entity Recognition", author = "Derczynski, Leon and Nichols, Eric and van Erp, Marieke and Limsopatham, Nut", booktitle = "Proceedings of the 3rd Workshop on Noisy User‑generated Text", month = sep, year = "2017", address = "Copenhagen, Denmark", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W17-4418", doi = "10.18653/v1/W17-4418", pages = "140--147", abstract = "This shared task focuses on identifying unusual, previously‑unseen entities in the context of emerging discussions. ..." }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Named Entity Recognition
Text Classification

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.