DFKI-SLT/DWIE
The DWIE (Deutsche Welle Information Extraction) corpus is a new dataset designed for document‑level multi‑task information extraction. It combines four main IE subtasks: named entity recognition, coreference resolution, relation extraction, and entity linking. The dataset includes detailed entity and relation information, linked to Wikipedia, and is suitable for feature extraction and text classification tasks on English text.
Description
Dataset Overview
Basic Information
- Name: DWIE (Deutsche Welle corpus for Information Extraction)
- Language: English
- License: Other
- Multilinguality: Monolingual
- Size: 10M<n<100M
- Source Data: Raw data
- Task Categories: Feature extraction, Text classification
- Task ID: entity-linking-classification
- Paper Code ID: acronym-identification
- Labels: Named Entity Recognition, Coreference Resolution, Relation Extraction, Entity Linking
Dataset Structure
Data Fields
- id: Unique identifier of the article.
- content: Text of the article, downloaded via
src/dwie_download.py. - tags: Documents indicating train or test split.
- mentions: List of entity mentions in the article, each containing:
begin: Offset of the first character of the mention in thecontentfield.end: Offset of the last character of the mention in thecontentfield.text: Text representation of the entity mention.concept: Entity ID that the mention refers to (multiple mentions may point to the same concept).candidates: Candidate Wikipedia links.scores: Prior probabilities of candidate entity links computed from Wikipedia corpora.
- concepts: Aggregated list of entities, each annotation includes:
concept: Unique ID of the document‑level entity.text: Text of the longest mention for the entity.keyword: Flag indicating whether the entity is a keyword.count: Number of mentions of the entity in the document.link: Wikipedia link of the entity.tags: Multi‑label classification tags associated with the entity.
- relations: Document‑level list of relations between entities (concepts), each annotation includes:
s: Subject entity ID.p: Predicate defining the relation name (e.g., "citizen_of", "member_of").o: Object entity ID.
- iptc: Multi‑label IPTC classification codes for the article.
Dataset Creation
Data Sources
- Initial data collection and normalization: Not provided.
- Source language producers: Not provided.
Annotation
- Annotation process: Not provided.
- Annotators: Not provided.
Personal and Sensitive Information
- Handling of personal and sensitive information: Not provided.
Considerations for Using the Dataset
- Social impact: Not provided.
- Bias discussion: Not provided.
- Other known limitations: Not provided.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.