Back to datasets
Dataset assetOpen Source CommunityText AnalysisNews Classification

BBC-Dataset-News-Classification

The collection comprises 2,225 news articles from the BBC News website between 2004 and 2005, covering five thematic domains: business, entertainment, politics, sports, and technology.

Source
github
Created
Apr 8, 2019
Updated
Dec 5, 2019
Signals
342 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

BBC-Dataset-News-Classification

Dataset Content

  • Number of Documents: 2,225
  • Source: BBC News website
  • Time Span: 2004–2005
  • Thematic Domains: 5 (business, entertainment, politics, sports, technology)

Dataset Structure

  • File Description:
    • dataset/data_files: Folder containing multiple news txt files.
    • dataset/dataset.csv: CSV file with two columns, “News” and “Category”; the “News” column holds the article text and the “Category” column holds the label.
    • model/get_data.py: Script that consolidates all txt files into a single two‑column (“News”, “Category”) CSV file.
    • model/model.py: Contains preprocessing, TF‑IDF feature extraction, model construction, and evaluation code.
    • model/test.ipynb: Jupyter notebook.

Dataset Split

  • Training Set: 1,780 samples
  • Test Set: 445 samples

Evaluation Results

  • Accuracy: 0.9573
  • Kappa Coefficient: 0.9461
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio