JUHE API Marketplace
DATASET
Open Source Community

CogComp/trec

The Text Retrieval Conference (TREC) question classification dataset contains 5,500 training questions and 500 test questions. It provides six coarse‑grained categories and 50 fine‑grained categories. The questions originate from four sources: 4,500 English questions released by USC, ~500 manually constructed questions, 894 questions from TREC‑8 and TREC‑9, and 500 test questions from TREC‑10. All questions are manually labeled. The task is text classification, specifically multi‑class classification.

Updated 1/18/2024
hugging_face

Description

Dataset Overview

Basic Information

  • Dataset Name: Text Retrieval Conference Question Answering (TRECQA)
  • Language: English (en)
  • License: Unknown
  • Multilinguality: Monolingual
  • Size: 1K < n < 10K
  • Source: Raw data
  • Task Type: Text Classification
  • Task ID: Multi‑class Classification
  • Paper ID: trecqa
  • Pretty Name: Text Retrieval Conference Question Answering

Structure

Features

  • text (string): question text.
  • coarse_label (categorical): coarse categories, possible values:
    • ABBR (0): abbreviation.
    • ENTY (1): entity.
    • DESC (2): description/abstract concept.
    • HUM (3): human.
    • LOC (4): location.
    • NUM (5): numeric.
  • fine_label (categorical): fine categories, grouped under ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION, NUMERIC (see original for full list).

Splits

  • train: 5,452 samples
  • test: 500 samples

Dataset Creation

Summary

  • Training set: 5,452 labeled questions
  • Test set: 500 labeled questions
  • Coarse categories: 6
  • Fine categories: 50
  • Average sentence length: 10 tokens
  • Vocabulary size: 8,700

Sources

  • 4,500 questions from USC (Hovy et al., 2001)
  • ~500 manually created questions for rare classes
  • 894 questions from TREC‑8 and TREC‑9
  • 500 TREC‑10 questions as test set

Citation

@inproceedings{li-roth-2002-learning,
    title = "Learning Question Classifiers",
    author = "Li, Xin and Roth, Dan",
    booktitle = "COLING 2002",
    year = "2002",
    url = "https://www.aclweb.org/anthology/C02-1150",
}
@inproceedings{hovy-etal-2001-toward,
    title = "Toward Semantics‑Based Answer Pinpointing",
    author = "Hovy, Eduard and Gerber, Laurie and ...",
    booktitle = "First International Conference on Human Language Technology Research",
    year = "2001",
    url = "https://www.aclweb.org/anthology/H01-1069",
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Text Classification
Multi-class Classification

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.