JUHE API Marketplace
DATASET
Open Source Community

X-SUM database

The X‑SUM database is a collection of British BBC online articles, focusing on the sports category, comprising approximately 50,000 pure‑sports articles covering 60 different sports.

Updated 12/15/2023
github

Description

Dataset Overview

Dataset Description

  • Name: X‑SUM Database
  • Source: Online articles from the UK BBC
  • Category: Sports articles
  • Quantity: Approximately 50,000 pure‑sports articles, covering 60 different sports

Text Techniques

  • Summarization Techniques: Two main approaches
    • Extractive Summarization: Determines the most relevant words from a document's vocabulary and combines them to form a summary
    • Abstractive Summarization: Uses advanced deep‑learning methods to simulate human behavior, allowing the model to add and generate words not present in the original document's vocabulary

Model Comparison

  • Baseline Model: Serves only as a reference, selecting the first three sentences of each article
  • T5: End‑to‑end text‑to‑text transformer model suitable for various NLP tasks, including summarization
  • BART: Denoising auto‑encoder for sequence‑to‑sequence modeling that corrupts text with arbitrary noise and reconstructs the original
  • PEGASUS: Model specifically designed for abstractive summarization, pre‑trained with a self‑supervised gap‑sentence generation objective

Evaluation Results

  • Metrics: ROUGE statistics (including ROUGE‑1, ROUGE‑2, ROUGE‑L and ROUGE‑L SUM)
  • Model Performance:
    • Baseline: ROUGE‑1: 0.168, ROUGE‑2: 0.020, ROUGE‑L: 0.107, ROUGE‑L SUM: 0.107
    • T5: ROUGE‑1: 0.171, ROUGE‑2: 0.023, ROUGE‑L: 0.117, ROUGE‑L SUM: 0.166
    • BART: ROUGE‑1: 0.203, ROUGE‑2: 0.041, ROUGE‑L: 0.135, ROUGE‑L SUM: 0.166
    • PEGASUS: ROUGE‑1: 0.472, ROUGE‑2: 0.269, ROUGE‑L: 0.412, ROUGE‑L SUM: 0.414

Model Fine‑tuning

  • Goal: Improve PEGASUS performance on sports documents
  • Method: Fine‑tune PEGASUS, adjusting weights learned on a large dataset to the specific task
  • Results:
    • Before fine‑tuning: ROUGE‑1: 0.472, ROUGE‑2: 0.269, ROUGE‑L: 0.412, ROUGE‑L SUM: 0.414
    • After fine‑tuning: ROUGE‑1: 0.497, ROUGE‑2: 0.275, ROUGE‑L: 0.418, ROUGE‑L SUM: 0.418
    • Improvement: Mainly in the ROUGE‑1 metric, indicating better identification of relevant unigrams in the source documents

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Sports News
Database

Source

Organization: github

Created: 3/21/2023

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.