DATASET
Open Source Community
X-SUM database
The X‑SUM database is a collection of British BBC online articles, focusing on the sports category, comprising approximately 50,000 pure‑sports articles covering 60 different sports.
Updated 12/15/2023
github
Description
Dataset Overview
Dataset Description
- Name: X‑SUM Database
- Source: Online articles from the UK BBC
- Category: Sports articles
- Quantity: Approximately 50,000 pure‑sports articles, covering 60 different sports
Text Techniques
- Summarization Techniques: Two main approaches
- Extractive Summarization: Determines the most relevant words from a document's vocabulary and combines them to form a summary
- Abstractive Summarization: Uses advanced deep‑learning methods to simulate human behavior, allowing the model to add and generate words not present in the original document's vocabulary
Model Comparison
- Baseline Model: Serves only as a reference, selecting the first three sentences of each article
- T5: End‑to‑end text‑to‑text transformer model suitable for various NLP tasks, including summarization
- BART: Denoising auto‑encoder for sequence‑to‑sequence modeling that corrupts text with arbitrary noise and reconstructs the original
- PEGASUS: Model specifically designed for abstractive summarization, pre‑trained with a self‑supervised gap‑sentence generation objective
Evaluation Results
- Metrics: ROUGE statistics (including ROUGE‑1, ROUGE‑2, ROUGE‑L and ROUGE‑L SUM)
- Model Performance:
- Baseline: ROUGE‑1: 0.168, ROUGE‑2: 0.020, ROUGE‑L: 0.107, ROUGE‑L SUM: 0.107
- T5: ROUGE‑1: 0.171, ROUGE‑2: 0.023, ROUGE‑L: 0.117, ROUGE‑L SUM: 0.166
- BART: ROUGE‑1: 0.203, ROUGE‑2: 0.041, ROUGE‑L: 0.135, ROUGE‑L SUM: 0.166
- PEGASUS: ROUGE‑1: 0.472, ROUGE‑2: 0.269, ROUGE‑L: 0.412, ROUGE‑L SUM: 0.414
Model Fine‑tuning
- Goal: Improve PEGASUS performance on sports documents
- Method: Fine‑tune PEGASUS, adjusting weights learned on a large dataset to the specific task
- Results:
- Before fine‑tuning: ROUGE‑1: 0.472, ROUGE‑2: 0.269, ROUGE‑L: 0.412, ROUGE‑L SUM: 0.414
- After fine‑tuning: ROUGE‑1: 0.497, ROUGE‑2: 0.275, ROUGE‑L: 0.418, ROUGE‑L SUM: 0.418
- Improvement: Mainly in the ROUGE‑1 metric, indicating better identification of relevant unigrams in the source documents
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Sports News
Database
Source
Organization: github
Created: 3/21/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.