Dataset assetOpen Source CommunityText AnalysisNews Classification

BBC-Dataset-News-Classification

The collection comprises 2,225 news articles from the BBC News website between 2004 and 2005, covering five thematic domains: business, entertainment, politics, sports, and technology.

Source

github

Created

Apr 8, 2019

Updated

Dec 5, 2019

Signals

342 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

BBC-Dataset-News-Classification

Dataset Content

Number of Documents: 2,225
Source: BBC News website
Time Span: 2004–2005
Thematic Domains: 5 (business, entertainment, politics, sports, technology)

Dataset Structure

File Description:
- dataset/data_files: Folder containing multiple news txt files.
- dataset/dataset.csv: CSV file with two columns, “News” and “Category”; the “News” column holds the article text and the “Category” column holds the label.
- model/get_data.py: Script that consolidates all txt files into a single two‑column (“News”, “Category”) CSV file.
- model/model.py: Contains preprocessing, TF‑IDF feature extraction, model construction, and evaluation code.
- model/test.ipynb: Jupyter notebook.

Dataset Split

Training Set: 1,780 samples
Test Set: 445 samples

Evaluation Results

Accuracy: 0.9573
Kappa Coefficient: 0.9461

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio