Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Category index
Showing 5 of 5 datasets
Category: Text Mining

bigbio/genia_term_corpus

BioinformaticsText Mining

The GENIA Term Corpus focuses on recognizing entities of interest in molecular biology such as proteins, genes, and cells, which is a fundamental task in biomedical text mining. The GENIA technical term annotations cover physical biological entities as well as other important terminology. The corpus annotates abstracts from the main GENIA corpus, totaling 1,999 abstracts.

Source hugging_faceUpdated Dec 22, 2022265 viewsLinked
Inspect dataset

google/civil_comments

Online Comment AnalysisText Mining

The dataset comprises publicly available comments from the Civil Comments platform, which served as a commenting plugin for independent news websites. These comments were created between 2015 and 2017 and appeared on approximately 50 English-language news sites worldwide. When Civil Comments shut down in 2017, the public comments were preserved in a permanent open archive for future research. The original data include the comment text, some associated metadata (e.g., article ID, timestamp, and the commenter‑generated “civil” label), but omit user IDs. Jigsaw extended this dataset by adding additional toxicity and identity‑mention labels. This dataset is an exact replica of the data used in Jigsaw’s “Unintended Bias in Toxicity Classification” challenge on Kaggle. Both the dataset and the underlying comment texts are released under the CC0 license.

Source hugging_faceUpdated Jan 25, 2024330 viewsLinked
Inspect dataset

orieg/elsevier-oa-cc-by

Academic ResearchText Mining

The Elsevier OA CC‑By dataset is a corpus of 40,091 open‑access articles released under a CC‑BY license, covering multiple disciplines from Elsevier journals. The articles were published between 2014 and 2020 and are classified into 27 mid‑level ASJC codes. The dataset supports various NLP tasks such as fill‑mask, summarization, and text classification. It includes fields such as document ID, metadata, abstract, body text, bibliography entries, and author highlights. The corpus is intended to facilitate NLP and ML research by providing a large, multidisciplinary collection of full‑text articles.

Source hugging_faceUpdated Jul 1, 2022121 viewsLinked
Inspect dataset

Kpop-lyric-datasets

Korean Pop MusicText Mining

A JSON‑format dataset comprising 25,696 Korean pop songs, sourced from Melon's monthly charts (2000 – October 2023). The dataset includes Python functions for data processing and emphasizes copyright attribution and usage restrictions.

Source githubUpdated Dec 3, 2023599 viewsLinked
Inspect dataset

reddit_dataset_149

Social Media AnalysisText Mining

This dataset is part of Bittensor Subnet 13, a decentralized network, and contains pre‑processed Reddit data. The data is continuously updated by network miners, providing a real‑time Reddit content stream suitable for various analysis and machine‑learning tasks. The dataset includes Reddit posts and comments with fields such as text, label, data type, community name, timestamp, anonymized username, and anonymized URL. While primarily English, it may contain multilingual content. Released under an MIT license and subject to Reddit's terms of use, users should be aware of potential biases, data quality variation, and temporal bias.

Source huggingfaceUpdated Nov 30, 2024118 viewsLinked
Inspect dataset