High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

bigbio/genia_term_corpus

The GENIA Term Corpus focuses on recognizing entities of interest in molecular biology such as proteins, genes, and cells, which is a fundamental task in biomedical text mining. The GENIA technical term annotations cover physical biological entities as well as other important terminology. The corpus annotates abstracts from the main GENIA corpus, totaling 1,999 abstracts.

hugging_face

View Details

google/civil_comments

Online Comment Analysis

Text Mining

The dataset comprises publicly available comments from the Civil Comments platform, which served as a commenting plugin for independent news websites. These comments were created between 2015 and 2017 and appeared on approximately 50 English-language news sites worldwide. When Civil Comments shut down in 2017, the public comments were preserved in a permanent open archive for future research. The original data include the comment text, some associated metadata (e.g., article ID, timestamp, and the commenter‑generated “civil” label), but omit user IDs. Jigsaw extended this dataset by adding additional toxicity and identity‑mention labels. This dataset is an exact replica of the data used in Jigsaw’s “Unintended Bias in Toxicity Classification” challenge on Kaggle. Both the dataset and the underlying comment texts are released under the CC0 license.

hugging_face

View Details

orieg/elsevier-oa-cc-by

Academic Research

Text Mining

The Elsevier OA CC‑By dataset is a corpus of 40,091 open‑access articles released under a CC‑BY license, covering multiple disciplines from Elsevier journals. The articles were published between 2014 and 2020 and are classified into 27 mid‑level ASJC codes. The dataset supports various NLP tasks such as fill‑mask, summarization, and text classification. It includes fields such as document ID, metadata, abstract, body text, bibliography entries, and author highlights. The corpus is intended to facilitate NLP and ML research by providing a large, multidisciplinary collection of full‑text articles.

hugging_face

View Details

Kpop-lyric-datasets

Korean Pop Music

Text Mining

A JSON‑format dataset comprising 25,696 Korean pop songs, sourced from Melon's monthly charts (2000 – October 2023). The dataset includes Python functions for data processing and emphasizes copyright attribution and usage restrictions.

github

View Details

reddit_dataset_149

Social Media Analysis

Text Mining

This dataset is part of Bittensor Subnet 13, a decentralized network, and contains pre‑processed Reddit data. The data is continuously updated by network miners, providing a real‑time Reddit content stream suitable for various analysis and machine‑learning tasks. The dataset includes Reddit posts and comments with fields such as text, label, data type, community name, timestamp, anonymized username, and anonymized URL. While primarily English, it may contain multilingual content. Released under an MIT license and subject to Reddit's terms of use, users should be aware of potential biases, data quality variation, and temporal bias.

huggingface

View Details