Dataset assetOpen Source CommunitySentiment AnalysisText Summarization

GEM/xsum

XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article. The dataset originates from BBC articles, language is British English, primarily used for abstractive summarization. The dataset structure includes document, summary, and ID fields, and is randomly split into training, validation, and test sets. The creators are from the University of Edinburgh, and the license is CC BY‑SA 4.0.

Source

hugging_face

Created

Nov 28, 2025

Updated

Oct 24, 2022

Signals

173 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Basic Information

Name: XSum
Language: English
License: cc‑by‑sa‑4.0
Task Category: Summarization
Source: Original data

Dataset Details

Dataset Summary

XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article.

Dataset Structure

Data fields:
- Document: input news article.
- Summary: one‑sentence summary of the article.
- Id: BBC article ID.

Dataset Uses

The dataset is used for extreme abstractive summarization, aiming to answer “what is the article about” with a single sentence.

Dataset Creators

Creators: Shashi Narayan, Shay B. Cohen, Mirella Lapata
Affiliation: University of Edinburgh

Dataset Download and Documentation

Download link: GitHub
Related paper: ACL Anthology

Dataset Maintenance

Maintenance plan: None

Language and Usage

Language Coverage

Language: English (British English)
Producers: Professional journalists

License

License details: Creative Commons Attribution Share Alike 4.0 International

Primary Task

Task: Summarization

Communication Goal

Goal: Given a news article, generate a single‑sentence summary of its content.

Dataset Creation and Funding

Creation Organization

Organization type: Academic
Organization: University of Edinburgh

Funding Sources

Funding: European Research Council, EU Horizon 2020 SUMMA project, Huawei Technologies

Dataset Structure and Labels

Data Splits

Splits: Train (204,045 articles), Validation (11,332 articles), Test (11,334 articles)

Split Criteria

Criteria: Random split using identifiers in the URL

Label Selection

Labels: The first sentence of the source article used as the label

Data Collection

Original Collection Rationale

Rationale: To evaluate truly abstractive models, as most existing datasets are extractive.

Language Data Acquisition

Acquisition method: Single‑website crawl

Data Pre‑processing

Pre‑processing: Extraction of text from HTML, no further processing

Data Filtering

Filtering: None

Social Impact

Social Bias

Bias: Unclear whether documented social bias exists

Producer Representativeness

Representativeness: Content focuses on UK news and does not represent global English usage

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio