JUHE API Marketplace
DATASET
Open Source Community

DFKI-SLT/DWIE

The DWIE (Deutsche Welle Information Extraction) corpus is a new dataset designed for document‑level multi‑task information extraction. It combines four main IE subtasks: named entity recognition, coreference resolution, relation extraction, and entity linking. The dataset includes detailed entity and relation information, linked to Wikipedia, and is suitable for feature extraction and text classification tasks on English text.

Updated 5/15/2024
hugging_face

Description

Dataset Overview

Basic Information

  • Name: DWIE (Deutsche Welle corpus for Information Extraction)
  • Language: English
  • License: Other
  • Multilinguality: Monolingual
  • Size: 10M<n<100M
  • Source Data: Raw data
  • Task Categories: Feature extraction, Text classification
  • Task ID: entity-linking-classification
  • Paper Code ID: acronym-identification
  • Labels: Named Entity Recognition, Coreference Resolution, Relation Extraction, Entity Linking

Dataset Structure

Data Fields

  • id: Unique identifier of the article.
  • content: Text of the article, downloaded via src/dwie_download.py.
  • tags: Documents indicating train or test split.
  • mentions: List of entity mentions in the article, each containing:
    • begin: Offset of the first character of the mention in the content field.
    • end: Offset of the last character of the mention in the content field.
    • text: Text representation of the entity mention.
    • concept: Entity ID that the mention refers to (multiple mentions may point to the same concept).
    • candidates: Candidate Wikipedia links.
    • scores: Prior probabilities of candidate entity links computed from Wikipedia corpora.
  • concepts: Aggregated list of entities, each annotation includes:
    • concept: Unique ID of the document‑level entity.
    • text: Text of the longest mention for the entity.
    • keyword: Flag indicating whether the entity is a keyword.
    • count: Number of mentions of the entity in the document.
    • link: Wikipedia link of the entity.
    • tags: Multi‑label classification tags associated with the entity.
  • relations: Document‑level list of relations between entities (concepts), each annotation includes:
    • s: Subject entity ID.
    • p: Predicate defining the relation name (e.g., "citizen_of", "member_of").
    • o: Object entity ID.
  • iptc: Multi‑label IPTC classification codes for the article.

Dataset Creation

Data Sources

  • Initial data collection and normalization: Not provided.
  • Source language producers: Not provided.

Annotation

  • Annotation process: Not provided.
  • Annotators: Not provided.

Personal and Sensitive Information

  • Handling of personal and sensitive information: Not provided.

Considerations for Using the Dataset

  • Social impact: Not provided.
  • Bias discussion: Not provided.
  • Other known limitations: Not provided.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Information Extraction
Entity Recognition

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.