Back to datasets
Dataset assetOpen Source CommunityText GenerationBiography Data
WikiBio (wikipedia biography dataset)
The dataset collects 728,321 Wikipedia biographies, intended to evaluate text generation algorithms. Each article provides the opening paragraph and infobox (both tokenized). It is used to assess algorithms that generate text from structured data, particularly in the biography domain.
Source
github
Created
Oct 25, 2016
Updated
Aug 4, 2023
Signals
265 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Name
WikiBio (wikipedia biography dataset)
Dataset Content
- Contains 728,321 Wikipedia biography articles.
- Each article provides the opening paragraph text and infobox (both tokenized).
Intended Use
For evaluating text generation algorithms, especially in the biography domain.
Data Processing
- Preprocessed using Stanford CoreNLP, including sentence segmentation and tokenization.
- Randomly split into three subsets: training (80%), validation (10%), test (10%).
Dataset Structure
The dataset is divided into three subdirectories: train, valid, test. Each subdirectory contains seven files:
- SET.id: list of Wikipedia article IDs.
- SET.url: list of Wikipedia article URLs.
- SET.box: infobox data.
- SET.nb: number of sentences per article.
- SET.sent: list of sentences.
- SET.title: article title.
- SET.contributors: list of article contributors.
Infobox Data Format
- Each line represents an infobox.
- Infoboxes are encoded as lists, with tab‑separated tokens.
- Each token format: fieldname_position:wordtype.
- Empty fields or unreadable tokens are represented as: fieldname:
.
Citation
When using this dataset, cite the following papers:
- Neural Text Generation from Structured Data with Application to the Biography Domain
- Rémi Lebret, David Grangier and Michael Auli, EMNLP 16
- Paper link: http://arxiv.org/abs/1603.07771
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.