JUHE API Marketplace
DATASET
Open Source Community

BBC-Dataset-News-Classification

The collection comprises 2,225 news articles from the BBC News website between 2004 and 2005, covering five thematic domains: business, entertainment, politics, sports, and technology.

Updated 12/5/2019
github

Description

Dataset Overview

Dataset Name

BBC-Dataset-News-Classification

Dataset Content

  • Number of Documents: 2,225
  • Source: BBC News website
  • Time Span: 2004–2005
  • Thematic Domains: 5 (business, entertainment, politics, sports, technology)

Dataset Structure

  • File Description:
    • dataset/data_files: Folder containing multiple news txt files.
    • dataset/dataset.csv: CSV file with two columns, “News” and “Category”; the “News” column holds the article text and the “Category” column holds the label.
    • model/get_data.py: Script that consolidates all txt files into a single two‑column (“News”, “Category”) CSV file.
    • model/model.py: Contains preprocessing, TF‑IDF feature extraction, model construction, and evaluation code.
    • model/test.ipynb: Jupyter notebook.

Dataset Split

  • Training Set: 1,780 samples
  • Test Set: 445 samples

Evaluation Results

  • Accuracy: 0.9573
  • Kappa Coefficient: 0.9461

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

News Classification
Text Analysis

Source

Organization: github

Created: 4/8/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.