Dataset assetOpen Source CommunityText GenerationBiography Data

WikiBio (wikipedia biography dataset)

The dataset collects 728,321 Wikipedia biographies, intended to evaluate text generation algorithms. Each article provides the opening paragraph and infobox (both tokenized). It is used to assess algorithms that generate text from structured data, particularly in the biography domain.

Source

github

Created

Oct 25, 2016

Updated

Aug 4, 2023

Signals

265 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

WikiBio (wikipedia biography dataset)

Dataset Content

Contains 728,321 Wikipedia biography articles.
Each article provides the opening paragraph text and infobox (both tokenized).

Intended Use

For evaluating text generation algorithms, especially in the biography domain.

Data Processing

Preprocessed using Stanford CoreNLP, including sentence segmentation and tokenization.
Randomly split into three subsets: training (80%), validation (10%), test (10%).

Dataset Structure

The dataset is divided into three subdirectories: train, valid, test. Each subdirectory contains seven files:

SET.id: list of Wikipedia article IDs.
SET.url: list of Wikipedia article URLs.
SET.box: infobox data.
SET.nb: number of sentences per article.
SET.sent: list of sentences.
SET.title: article title.
SET.contributors: list of article contributors.

Infobox Data Format

Each line represents an infobox.
Infoboxes are encoded as lists, with tab‑separated tokens.
Each token format: fieldname_position:wordtype.
Empty fields or unreadable tokens are represented as: fieldname:.

Citation

When using this dataset, cite the following papers:

Neural Text Generation from Structured Data with Application to the Biography Domain
Rémi Lebret, David Grangier and Michael Auli, EMNLP 16
Paper link: http://arxiv.org/abs/1603.07771

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio