GEM/xsum
XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article. The dataset originates from BBC articles, language is British English, primarily used for abstractive summarization. The dataset structure includes document, summary, and ID fields, and is randomly split into training, validation, and test sets. The creators are from the University of Edinburgh, and the license is CC BY‑SA 4.0.
Description
Dataset Overview
Basic Information
- Name: XSum
- Language: English
- License: cc‑by‑sa‑4.0
- Task Category: Summarization
- Source: Original data
Dataset Details
Dataset Summary
XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article.
Dataset Structure
- Data fields:
Document: input news article.Summary: one‑sentence summary of the article.Id: BBC article ID.
Dataset Uses
The dataset is used for extreme abstractive summarization, aiming to answer “what is the article about” with a single sentence.
Dataset Creators
- Creators: Shashi Narayan, Shay B. Cohen, Mirella Lapata
- Affiliation: University of Edinburgh
Dataset Download and Documentation
- Download link: GitHub
- Related paper: ACL Anthology
Dataset Maintenance
- Maintenance plan: None
Language and Usage
Language Coverage
- Language: English (British English)
- Producers: Professional journalists
License
- License details: Creative Commons Attribution Share Alike 4.0 International
Primary Task
- Task: Summarization
Communication Goal
- Goal: Given a news article, generate a single‑sentence summary of its content.
Dataset Creation and Funding
Creation Organization
- Organization type: Academic
- Organization: University of Edinburgh
Funding Sources
- Funding: European Research Council, EU Horizon 2020 SUMMA project, Huawei Technologies
Dataset Structure and Labels
Data Splits
- Splits: Train (204,045 articles), Validation (11,332 articles), Test (11,334 articles)
Split Criteria
- Criteria: Random split using identifiers in the URL
Label Selection
- Labels: The first sentence of the source article used as the label
Data Collection
Original Collection Rationale
- Rationale: To evaluate truly abstractive models, as most existing datasets are extractive.
Language Data Acquisition
- Acquisition method: Single‑website crawl
Data Pre‑processing
- Pre‑processing: Extraction of text from HTML, no further processing
Data Filtering
- Filtering: None
Social Impact
Social Bias
- Bias: Unclear whether documented social bias exists
Producer Representativeness
- Representativeness: Content focuses on UK news and does not represent global English usage
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.