Explore high-quality datasets for your AI and machine learning projects.
The YelpReviewFull dataset contains review data collected from the Yelp website, mainly used for sentiment classification tasks. It includes 650,000 training samples and 50,000 test samples, each with a text field and a label field, where the label indicates the review rating (1 to 5 stars). The dataset was created via crowdsourcing and is in English.
The dataset contains training features: input_ids, attention_mask, and labels, each represented as integer sequences. The training split comprises 443,918 examples with a total size of approximately 1,004,301,613.693275 bytes. The download size is 235,069,151 bytes.
The PRIMATE dataset focuses on detecting anhedonia (loss of interest or pleasure) in mental‑health contexts. Re‑annotation by mental‑health professionals provides finer‑grained labels and textual evidence, revealing many false‑positive cases and resulting in a higher‑quality test set for anhedonia detection. The study highlights the necessity of addressing annotation quality in mental‑health datasets and advocates improved methods to enhance the reliability of NLP models for mental‑health assessment. Access to the PRIMATE dataset is required first, after which provided scripts can be used for label mapping. The dataset was created by extracting Reddit posts from the original PRIMATE collection and annotating them by mental‑health professionals. Only labels are included; the original post content is omitted.
The Text Retrieval Conference (TREC) question classification dataset contains 5,500 training questions and 500 test questions. It provides six coarse‑grained categories and 50 fine‑grained categories. The questions originate from four sources: 4,500 English questions released by USC, ~500 manually constructed questions, 894 questions from TREC‑8 and TREC‑9, and 500 test questions from TREC‑10. All questions are manually labeled. The task is text classification, specifically multi‑class classification.
The Sentiment140 dataset contains Twitter messages with emojis, which are used as noisy sentiment labels. It is primarily used for sentiment classification tasks, containing 1,600,000 training instances and 498 test instances. Fields include text, date, user, sentiment, and query.
This dataset is intended for text‑classification tasks and contains two features: the text content and a label. Labels are binary, with 'neg' (negative) and 'pos' (positive). The data are split into training, validation, and test sets for model training, validation, and testing, respectively.
The Yahoo Answers topic classification dataset is constructed using the 10 largest primary categories. Each category contains 140,000 training samples and 6,000 test samples, totaling 1,400,000 training samples and 60,000 test samples. The dataset files include classes.txt, train.csv, and test.csv, where each sample has four columns: category index, question title, question content, and best answer.
The dataset includes three primary features: 'sentence' (string), 'label' (categorical with two classes: 0 for negative sentiment, 1 for positive sentiment), and 'idx' (integer index). The training set has 68,221 samples, the validation set 872 samples, and the test set 1,821 samples. Total download size is 3,403,184 bytes; total dataset size is 5,110,747 bytes.
The Fake News Opensources dataset is a curated and cleaned version of the opensources fake‑news collection, containing 5 915 569 articles divided into 12 categories. It is suitable for text‑classification tasks such as topic classification and fact‑checking. The dataset is monolingual (English) and released under the Apache‑2.0 license. It includes fields such as id, type, domain, scraped_at, url, authors, title, and content.
The dataset includes multiple features such as conversations, ID, source, category, and subcategory. The conversation feature contains a sender and content; ID and source are strings; category and subcategory are also strings. The dataset is split into a training set (4,378 samples) and a test set (100 samples). The total download size is 5,009,583 bytes and the overall size is 10,866,328 bytes.
The WNUT 17 dataset is a named entity recognition (NER) dataset focusing on identifying novel and rare entities in noisy text. It includes training (3,394 samples), validation (1,009 samples), and test (1,287 samples) sets. Each sample contains an ID, token list, and IOB2‑formatted NER labels covering entities such as companies, creative works, groups, locations, persons, and products. The dataset was created to provide definitions for emerging and rare entities and to support detection of such entities.
Reuters‑21578 text classification collection, used for text classification research, released in 1999.
The dataset comprises three configurations: contextual, copyright, and standard. Each configuration has specific features and splits. The contextual configuration includes `prompt`, `context`, and `category` fields; the copyright configuration includes `prompt` and `tags`; the standard configuration includes `prompt` and `category`. Training set sizes and sample counts differ for each configuration.
The Amazon Review Polarity dataset contains product reviews from Amazon, primarily for text‑classification tasks, especially sentiment classification. Reviews rated 1‑2 are labeled negative, 4‑5 positive, and rating 3 is omitted. The dataset includes 3.6 M training samples and 0.4 M test samples; each record comprises a review title, content, and a label (positive or negative). It was created by Xiang Zhang and is widely used as a benchmark for text‑classification research.
--- configs: - config_name: default data_files: - split: test path: data/test-* - split: train path: data/train-* - split: validation path: data/validation-* dataset_info: features: - name: text dtype: string - name: label dtype: class_label: names: '0': '100' '1': '101' '2': '102' '3': '103' '4': '104' '5': '106' '6': '107' '7': '108' '8': '109' '9': '110' '10': '112' '11': '113' '12': '114' '13': '115' '14': '116' - name: idx dtype: int32 splits: - name: test num_bytes: 810970 num_examples: 10000 - name: train num_bytes: 4245677 num_examples: 53360 - name: validation num_bytes: 797922 num_examples: 10000 download_size: 4697191 dataset_size: 5854569 --- # Dataset Card for "TNews-classification" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
--- language: - en license: cc0-1.0 size_categories: - 1M<n<10M task_categories: - text-classification - text-generation - text2text-generation dataset_info: - config_name: all features: - name: id dtype: int64 - name: copyright dtype: string - name: character dtype: string - name: artist dtype: string - name: general dtype: string - name: meta dtype: string - name: rating dtype: string - name: score dtype: int64 - name: created_at dtype: string splits: - name: train num_bytes: 2507757369 num_examples: 4601557 download_size: 991454905 dataset_size: 2507757369 - config_name: safe features: - name: id dtype: int64 - name: copyright dtype: string - name: character dtype: string - name: artist dtype: string - name: general dtype: string - name: meta dtype: string - name: rating dtype: string - name: score dtype: int64 - name: created_at dtype: string splits: - name: train num_bytes: 646613535.5369519 num_examples: 1186490 download_size: 247085114 dataset_size: 646613535.5369519 configs: - config_name: all data_files: - split: train path: all/train-* - config_name: safe data_files: - split: train path: safe/train-* tags: - danbooru --- # danbooru-tags-2016-2023 A dataset of danbooru tags. ## Dataset information Generated using [danbooru](https://danbooru.donmai.us/) and [safebooru](https://safebooru.donmai.us/) API. The dataset was created with the following conditions: |Subset name|`all`|`safe`| |-|-|-| |API Endpoint|https://danbooru.donmai.us|https://safebooru.donmai.us| |Date|`2016-01-01..2023-12-31`|`2016-01-01..2023-12-31`| |Score|`>0`|`>0`| |Rating|`g,s,q,e`|`g`| |Filetype|`png,jpg,webp`|`png,jpg,webp`| |Size (number of rows)|4,601,557|1,186,490| ## Usage ``` pip install datasets ``` ```py from datasets import load_dataset dataset = load_dataset( "isek-ai/danbooru-tags-2016-2023", "safe", # or "all" split="train", ) print(dataset) print(dataset[0]) # Dataset({ # features: ['id', 'copyright', 'character', 'artist', 'general', 'meta', 'rating', 'score', 'created_at'], # num_rows: 1186490 # }) # {'id': 2229839, 'copyright': 'kara no kyoukai', 'character': 'ryougi shiki', 'artist': 'momoko (momopoco)', 'general': '1girl, 2016, :|, brown eyes, brown hair, closed mouth, cloud, cloudy sky, dated, day, flower, hair flower, hair ornament, japanese clothes, kimono, long hair, long sleeves, looking at viewer, new year, obi, outdoors, sash, shrine, sky, solo, standing, wide sleeves', 'meta': 'commentary request, partial commentary', 'rating': 'g', 'score': 76, 'created_at': '2016-01-01T00:43:18.369+09:00'} ```
This dataset is a Yahoo Answers topic‑classification dataset for text‑classification tasks. It contains 1.4 million training examples and 60 000 test examples. Each example includes a question title, question content, the best answer, and the corresponding topic label. The topic labels cover ten categories such as Society & Culture, Science & Mathematics, Health, etc. The dataset language is English and it is monolingual.
This dataset contains erotic stories that have been cleaned, deduplicated, and depolluted, intended for training text‑filtering classifiers. The data originates from the HuggingFace datasets bluuwhale/nsfwstory and bluuwhale/nsfwstory2. The dataset comprises 49,579 samples, and the downloaded parquet file is 646 MB.
CrowS‑Pairs is a challenging dataset for evaluating social bias in masked language models. It contains 1,508 test samples, each comprising two sentences—one more biased and one less biased. The dataset covers various bias types such as race, gender, religion, etc. It was constructed from the fictional portions of ROCStories and MNLI and annotated via crowdsourcing.
Aegis AI Content Safety Dataset 1.0 is an open‑source content safety dataset (CC‑BY‑4.0) that follows Nvidia's content safety taxonomy, covering 13 key risk categories. It includes approximately 11,000 human‑annotated interaction records between humans and LLMs, split into 10,798 training samples and 1,199 test samples. The data originate from Anthropic HH‑RLHF and Mistral‑7B‑v0.1, annotated by 12 annotators and 2 data quality assurance personnel. The dataset is intended for building content‑moderation safeguards and aligning LLMs to generate safe responses, but is not suitable for training dialogue agents. Its creation involved strict QA and annotator training to ensure diversity and accuracy.
The dataset named labelled_vi_ko_raw_text includes three primary features: src (source text), tgt (target text), and classifier_labels (classification labels). The dataset is primarily used for training, containing 40,000 samples, with a total data size of 9,844,626 bytes and a download size of 5,466,676 bytes.