GEM/xsum
XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article. The dataset originates from BBC articles, language is British English, primarily used for abstractive summarization. The dataset structure includes document, summary, and ID fields, and is randomly split into training, validation, and test sets. The creators are from the University of Edinburgh, and the license is CC BY‑SA 4.0.
Dataset description and usage context
Dataset Overview
Basic Information
- Name: XSum
- Language: English
- License: cc‑by‑sa‑4.0
- Task Category: Summarization
- Source: Original data
Dataset Details
Dataset Summary
XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article.
Dataset Structure
- Data fields:
Document: input news article.Summary: one‑sentence summary of the article.Id: BBC article ID.
Dataset Uses
The dataset is used for extreme abstractive summarization, aiming to answer “what is the article about” with a single sentence.
Dataset Creators
- Creators: Shashi Narayan, Shay B. Cohen, Mirella Lapata
- Affiliation: University of Edinburgh
Dataset Download and Documentation
- Download link: GitHub
- Related paper: ACL Anthology
Dataset Maintenance
- Maintenance plan: None
Language and Usage
Language Coverage
- Language: English (British English)
- Producers: Professional journalists
License
- License details: Creative Commons Attribution Share Alike 4.0 International
Primary Task
- Task: Summarization
Communication Goal
- Goal: Given a news article, generate a single‑sentence summary of its content.
Dataset Creation and Funding
Creation Organization
- Organization type: Academic
- Organization: University of Edinburgh
Funding Sources
- Funding: European Research Council, EU Horizon 2020 SUMMA project, Huawei Technologies
Dataset Structure and Labels
Data Splits
- Splits: Train (204,045 articles), Validation (11,332 articles), Test (11,334 articles)
Split Criteria
- Criteria: Random split using identifiers in the URL
Label Selection
- Labels: The first sentence of the source article used as the label
Data Collection
Original Collection Rationale
- Rationale: To evaluate truly abstractive models, as most existing datasets are extractive.
Language Data Acquisition
- Acquisition method: Single‑website crawl
Data Pre‑processing
- Pre‑processing: Extraction of text from HTML, no further processing
Data Filtering
- Filtering: None
Social Impact
Social Bias
- Bias: Unclear whether documented social bias exists
Producer Representativeness
- Representativeness: Content focuses on UK news and does not represent global English usage
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.