Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingText Classification
Yahoo_Answers_10_categories_for_NLP
The Yahoo Answers topic classification dataset is constructed using the 10 largest primary categories. Each category contains 140,000 training samples and 6,000 test samples, totaling 1,400,000 training samples and 60,000 test samples. The dataset files include classes.txt, train.csv, and test.csv, where each sample has four columns: category index, question title, question content, and best answer.
Source
huggingface
Created
Jul 27, 2024
Updated
Jul 27, 2024
Signals
327 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Card
Dataset Overview
- Dataset Name: Yahoo Answers 10 categories for NLP
- Task Type: Text Classification
- Tags: categories, text data, nlp, yelp, fine-grained, 10 classes, yahoo, answers
- Language: English
- Data Scale: 1M<n<10M
- License: Apache 2.0
Dataset Description
- Dataset Construction: Built using the 10 largest primary categories of Yahoo! Answers.
- Data Content: Only the best answer content and primary category information are used.
- File Description:
classes.txt: Contains the list of categories corresponding to each label.train.csvandtest.csv: Contain all training and test samples in CSV format. Each row has 4 columns: category index (1 to 10), question title, question content, and best answer. Text fields are escaped with double quotes; internal double quotes are escaped by two double quotes; newline characters are escaped with a backslash followed by "n".
Dataset Source
- Kaggle Link: https://www.kaggle.com/datasets/yacharki/yahoo-answers-10-categories-for-nlp-csv
- DOI: 10.34740/KAGGLE/DSV/5339321
- Authors: Xiang Zhang and Acharki Yassir
- Year: 2023
Dataset Structure
- File List:
Readme.mdtest.csvtrain.csvclasses.txt
Dataset Usage
- Direct Use: Fine-grained text classification
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.