Dataset assetOpen Source CommunitySports NewsDatabase

X-SUM database

The X‑SUM database is a collection of British BBC online articles, focusing on the sports category, comprising approximately 50,000 pure‑sports articles covering 60 different sports.

Source

github

Created

Mar 21, 2023

Updated

Dec 15, 2023

Signals

110 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Description

Name: X‑SUM Database
Source: Online articles from the UK BBC
Category: Sports articles
Quantity: Approximately 50,000 pure‑sports articles, covering 60 different sports

Text Techniques

Summarization Techniques: Two main approaches
- Extractive Summarization: Determines the most relevant words from a document's vocabulary and combines them to form a summary
- Abstractive Summarization: Uses advanced deep‑learning methods to simulate human behavior, allowing the model to add and generate words not present in the original document's vocabulary

Model Comparison

Baseline Model: Serves only as a reference, selecting the first three sentences of each article
T5: End‑to‑end text‑to‑text transformer model suitable for various NLP tasks, including summarization
BART: Denoising auto‑encoder for sequence‑to‑sequence modeling that corrupts text with arbitrary noise and reconstructs the original
PEGASUS: Model specifically designed for abstractive summarization, pre‑trained with a self‑supervised gap‑sentence generation objective

Evaluation Results

Metrics: ROUGE statistics (including ROUGE‑1, ROUGE‑2, ROUGE‑L and ROUGE‑L SUM)
Model Performance:
- Baseline: ROUGE‑1: 0.168, ROUGE‑2: 0.020, ROUGE‑L: 0.107, ROUGE‑L SUM: 0.107
- T5: ROUGE‑1: 0.171, ROUGE‑2: 0.023, ROUGE‑L: 0.117, ROUGE‑L SUM: 0.166
- BART: ROUGE‑1: 0.203, ROUGE‑2: 0.041, ROUGE‑L: 0.135, ROUGE‑L SUM: 0.166
- PEGASUS: ROUGE‑1: 0.472, ROUGE‑2: 0.269, ROUGE‑L: 0.412, ROUGE‑L SUM: 0.414

Model Fine‑tuning

Goal: Improve PEGASUS performance on sports documents
Method: Fine‑tune PEGASUS, adjusting weights learned on a large dataset to the specific task
Results:
- Before fine‑tuning: ROUGE‑1: 0.472, ROUGE‑2: 0.269, ROUGE‑L: 0.412, ROUGE‑L SUM: 0.414
- After fine‑tuning: ROUGE‑1: 0.497, ROUGE‑2: 0.275, ROUGE‑L: 0.418, ROUGE‑L SUM: 0.418
- Improvement: Mainly in the ROUGE‑1 metric, indicating better identification of relevant unigrams in the source documents

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio