DATASET
Open Source Community
BBC-Dataset-News-Classification
The collection comprises 2,225 news articles from the BBC News website between 2004 and 2005, covering five thematic domains: business, entertainment, politics, sports, and technology.
Updated 12/5/2019
github
Description
Dataset Overview
Dataset Name
BBC-Dataset-News-Classification
Dataset Content
- Number of Documents: 2,225
- Source: BBC News website
- Time Span: 2004–2005
- Thematic Domains: 5 (business, entertainment, politics, sports, technology)
Dataset Structure
- File Description:
dataset/data_files: Folder containing multiple news txt files.dataset/dataset.csv: CSV file with two columns, “News” and “Category”; the “News” column holds the article text and the “Category” column holds the label.model/get_data.py: Script that consolidates all txt files into a single two‑column (“News”, “Category”) CSV file.model/model.py: Contains preprocessing, TF‑IDF feature extraction, model construction, and evaluation code.model/test.ipynb: Jupyter notebook.
Dataset Split
- Training Set: 1,780 samples
- Test Set: 445 samples
Evaluation Results
- Accuracy: 0.9573
- Kappa Coefficient: 0.9461
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
News Classification
Text Analysis
Source
Organization: github
Created: 4/8/2019
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.