Back to datasets
Dataset assetOpen Source CommunityText ClassificationMulti-class Classification
CogComp/trec
The Text Retrieval Conference (TREC) question classification dataset contains 5,500 training questions and 500 test questions. It provides six coarse‑grained categories and 50 fine‑grained categories. The questions originate from four sources: 4,500 English questions released by USC, ~500 manually constructed questions, 894 questions from TREC‑8 and TREC‑9, and 500 test questions from TREC‑10. All questions are manually labeled. The task is text classification, specifically multi‑class classification.
Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 18, 2024
Signals
500 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Basic Information
- Dataset Name: Text Retrieval Conference Question Answering (TRECQA)
- Language: English (en)
- License: Unknown
- Multilinguality: Monolingual
- Size: 1K < n < 10K
- Source: Raw data
- Task Type: Text Classification
- Task ID: Multi‑class Classification
- Paper ID: trecqa
- Pretty Name: Text Retrieval Conference Question Answering
Structure
Features
- text (string): question text.
- coarse_label (categorical): coarse categories, possible values:
- ABBR (0): abbreviation.
- ENTY (1): entity.
- DESC (2): description/abstract concept.
- HUM (3): human.
- LOC (4): location.
- NUM (5): numeric.
- fine_label (categorical): fine categories, grouped under ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION, NUMERIC (see original for full list).
Splits
- train: 5,452 samples
- test: 500 samples
Dataset Creation
Summary
- Training set: 5,452 labeled questions
- Test set: 500 labeled questions
- Coarse categories: 6
- Fine categories: 50
- Average sentence length: 10 tokens
- Vocabulary size: 8,700
Sources
- 4,500 questions from USC (Hovy et al., 2001)
- ~500 manually created questions for rare classes
- 894 questions from TREC‑8 and TREC‑9
- 500 TREC‑10 questions as test set
Citation
@inproceedings{li-roth-2002-learning,
title = "Learning Question Classifiers",
author = "Li, Xin and Roth, Dan",
booktitle = "COLING 2002",
year = "2002",
url = "https://www.aclweb.org/anthology/C02-1150",
}
@inproceedings{hovy-etal-2001-toward,
title = "Toward Semantics‑Based Answer Pinpointing",
author = "Hovy, Eduard and Gerber, Laurie and ...",
booktitle = "First International Conference on Human Language Technology Research",
year = "2001",
url = "https://www.aclweb.org/anthology/H01-1069",
}
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.