JUHE API Marketplace
DATASET
Open Source Community

GEM/xsum

XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article. The dataset originates from BBC articles, language is British English, primarily used for abstractive summarization. The dataset structure includes document, summary, and ID fields, and is randomly split into training, validation, and test sets. The creators are from the University of Edinburgh, and the license is CC BY‑SA 4.0.

Updated 10/24/2022
hugging_face

Description

Dataset Overview

Basic Information

  • Name: XSum
  • Language: English
  • License: cc‑by‑sa‑4.0
  • Task Category: Summarization
  • Source: Original data

Dataset Details

Dataset Summary

XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article.

Dataset Structure

  • Data fields:
    • Document: input news article.
    • Summary: one‑sentence summary of the article.
    • Id: BBC article ID.

Dataset Uses

The dataset is used for extreme abstractive summarization, aiming to answer “what is the article about” with a single sentence.

Dataset Creators

  • Creators: Shashi Narayan, Shay B. Cohen, Mirella Lapata
  • Affiliation: University of Edinburgh

Dataset Download and Documentation

Dataset Maintenance

  • Maintenance plan: None

Language and Usage

Language Coverage

  • Language: English (British English)
  • Producers: Professional journalists

License

  • License details: Creative Commons Attribution Share Alike 4.0 International

Primary Task

  • Task: Summarization

Communication Goal

  • Goal: Given a news article, generate a single‑sentence summary of its content.

Dataset Creation and Funding

Creation Organization

  • Organization type: Academic
  • Organization: University of Edinburgh

Funding Sources

  • Funding: European Research Council, EU Horizon 2020 SUMMA project, Huawei Technologies

Dataset Structure and Labels

Data Splits

  • Splits: Train (204,045 articles), Validation (11,332 articles), Test (11,334 articles)

Split Criteria

  • Criteria: Random split using identifiers in the URL

Label Selection

  • Labels: The first sentence of the source article used as the label

Data Collection

Original Collection Rationale

  • Rationale: To evaluate truly abstractive models, as most existing datasets are extractive.

Language Data Acquisition

  • Acquisition method: Single‑website crawl

Data Pre‑processing

  • Pre‑processing: Extraction of text from HTML, no further processing

Data Filtering

  • Filtering: None

Social Impact

Social Bias

  • Bias: Unclear whether documented social bias exists

Producer Representativeness

  • Representativeness: Content focuses on UK news and does not represent global English usage

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Sentiment Analysis
Text Summarization

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.