DATASET
Classic Dataset
Reuters-21578 Text Categorization Collection
Reuters‑21578 text classification collection, used for text classification research, released in 1999.
Updated 5/24/2019
github
Description
NLP_Dataset Overview
1. Text Classification
- Reuters‑21578 Text Categorization Collection (1999)
- Large Movie Review Dataset v1.0 (2011)
- Datasets for single‑label text categorization (2007)
2. Question Answering Systems
- Stanford Question Answering Dataset (SQuAD)
- Deepmind Question Answering Corpus
- Amazon question/answer data
3. Speech Recognition
- TIMIT Acoustic‑Phonetic Continuous Speech Corpus
- voxforge
- LibriSpeech ASR corpus
4. Machine Translation
- Aligned Hansards of the 36th Parliament of Canada Release 2001‑1a
- European Parliament Proceedings Parallel Corpus 1996‑2011
5. Document Summarization
- The AQUAINT Corpus of English News Text
- Legal Case Reports Data Set
6. More Datasets
Biomedical Domain
- Mutation extraction
- MutationFinder (MF)
- extractor of mutation (EMU)
- tmVar
All data sources: http://infos.korea.ac.kr/bronco/PublicCorpus.zip
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Text Classification
Source
Organization: github
Created: 5/18/2019
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.