Back to datasets
Dataset assetClassic DatasetText Classification

Reuters-21578 Text Categorization Collection

Reuters‑21578 text classification collection, used for text classification research, released in 1999.

Source
github
Created
May 18, 2019
Updated
May 24, 2019
Signals
212 views
Availability
Linked source ready
Overview

Dataset description and usage context

NLP_Dataset Overview

1. Text Classification

  • Reuters‑21578 Text Categorization Collection (1999)
  • Large Movie Review Dataset v1.0 (2011)
  • Datasets for single‑label text categorization (2007)

2. Question Answering Systems

  • Stanford Question Answering Dataset (SQuAD)
  • Deepmind Question Answering Corpus
  • Amazon question/answer data

3. Speech Recognition

  • TIMIT Acoustic‑Phonetic Continuous Speech Corpus
  • voxforge
  • LibriSpeech ASR corpus

4. Machine Translation

  • Aligned Hansards of the 36th Parliament of Canada Release 2001‑1a
  • European Parliament Proceedings Parallel Corpus 1996‑2011

5. Document Summarization

  • The AQUAINT Corpus of English News Text
  • Legal Case Reports Data Set

6. More Datasets

Biomedical Domain

  • Mutation extraction
    • MutationFinder (MF)
    • extractor of mutation (EMU)
    • tmVar

All data sources: http://infos.korea.ac.kr/bronco/PublicCorpus.zip

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio