Dataset Overview

Basic Information

Name: DWIE (Deutsche Welle corpus for Information Extraction)
Language: English
License: Other
Multilinguality: Monolingual
Size: 10M<n<100M
Source Data: Raw data
Task Categories: Feature extraction, Text classification
Task ID: entity-linking-classification
Paper Code ID: acronym-identification
Labels: Named Entity Recognition, Coreference Resolution, Relation Extraction, Entity Linking

id: Unique identifier of the article.
content: Text of the article, downloaded via src/dwie_download.py.
tags: Documents indicating train or test split.
mentions: List of entity mentions in the article, each containing:
- begin: Offset of the first character of the mention in the content field.
- end: Offset of the last character of the mention in the content field.
- text: Text representation of the entity mention.
- concept: Entity ID that the mention refers to (multiple mentions may point to the same concept).
- candidates: Candidate Wikipedia links.
- scores: Prior probabilities of candidate entity links computed from Wikipedia corpora.
concepts: Aggregated list of entities, each annotation includes:
- concept: Unique ID of the document‑level entity.
- text: Text of the longest mention for the entity.
- keyword: Flag indicating whether the entity is a keyword.
- count: Number of mentions of the entity in the document.
- link: Wikipedia link of the entity.
- tags: Multi‑label classification tags associated with the entity.
relations: Document‑level list of relations between entities (concepts), each annotation includes:
- s: Subject entity ID.
- p: Predicate defining the relation name (e.g., "citizen_of", "member_of").
- o: Object entity ID.
iptc: Multi‑label IPTC classification codes for the article.