Back to datasets
Dataset assetClassic DatasetText Classification
Reuters-21578 Text Categorization Collection
Reuters‑21578 text classification collection, used for text classification research, released in 1999.
Source
github
Created
May 18, 2019
Updated
May 24, 2019
Signals
212 views
Availability
Linked source ready
Overview
Dataset description and usage context
NLP_Dataset Overview
1. Text Classification
- Reuters‑21578 Text Categorization Collection (1999)
- Large Movie Review Dataset v1.0 (2011)
- Datasets for single‑label text categorization (2007)
2. Question Answering Systems
- Stanford Question Answering Dataset (SQuAD)
- Deepmind Question Answering Corpus
- Amazon question/answer data
3. Speech Recognition
- TIMIT Acoustic‑Phonetic Continuous Speech Corpus
- voxforge
- LibriSpeech ASR corpus
4. Machine Translation
- Aligned Hansards of the 36th Parliament of Canada Release 2001‑1a
- European Parliament Proceedings Parallel Corpus 1996‑2011
5. Document Summarization
- The AQUAINT Corpus of English News Text
- Legal Case Reports Data Set
6. More Datasets
Biomedical Domain
- Mutation extraction
- MutationFinder (MF)
- extractor of mutation (EMU)
- tmVar
All data sources: http://infos.korea.ac.kr/bronco/PublicCorpus.zip
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.