Back to datasets
Dataset assetOpen Source CommunitySentiment AnalysisText Summarization

GEM/xsum

XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article. The dataset originates from BBC articles, language is British English, primarily used for abstractive summarization. The dataset structure includes document, summary, and ID fields, and is randomly split into training, validation, and test sets. The creators are from the University of Edinburgh, and the license is CC BY‑SA 4.0.

Source
hugging_face
Created
Nov 28, 2025
Updated
Oct 24, 2022
Signals
173 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • Name: XSum
  • Language: English
  • License: cc‑by‑sa‑4.0
  • Task Category: Summarization
  • Source: Original data

Dataset Details

Dataset Summary

XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article.

Dataset Structure

  • Data fields:
    • Document: input news article.
    • Summary: one‑sentence summary of the article.
    • Id: BBC article ID.

Dataset Uses

The dataset is used for extreme abstractive summarization, aiming to answer “what is the article about” with a single sentence.

Dataset Creators

  • Creators: Shashi Narayan, Shay B. Cohen, Mirella Lapata
  • Affiliation: University of Edinburgh

Dataset Download and Documentation

Dataset Maintenance

  • Maintenance plan: None

Language and Usage

Language Coverage

  • Language: English (British English)
  • Producers: Professional journalists

License

  • License details: Creative Commons Attribution Share Alike 4.0 International

Primary Task

  • Task: Summarization

Communication Goal

  • Goal: Given a news article, generate a single‑sentence summary of its content.

Dataset Creation and Funding

Creation Organization

  • Organization type: Academic
  • Organization: University of Edinburgh

Funding Sources

  • Funding: European Research Council, EU Horizon 2020 SUMMA project, Huawei Technologies

Dataset Structure and Labels

Data Splits

  • Splits: Train (204,045 articles), Validation (11,332 articles), Test (11,334 articles)

Split Criteria

  • Criteria: Random split using identifiers in the URL

Label Selection

  • Labels: The first sentence of the source article used as the label

Data Collection

Original Collection Rationale

  • Rationale: To evaluate truly abstractive models, as most existing datasets are extractive.

Language Data Acquisition

  • Acquisition method: Single‑website crawl

Data Pre‑processing

  • Pre‑processing: Extraction of text from HTML, no further processing

Data Filtering

  • Filtering: None

Social Impact

Social Bias

  • Bias: Unclear whether documented social bias exists

Producer Representativeness

  • Representativeness: Content focuses on UK news and does not represent global English usage
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio