Explore high-quality datasets for your AI and machine learning projects.
This is a multi‑label emotion classification dataset based on the Go Emotion parameters. The dataset was annotated by a team of 12 engineers with custom tags. Additionally, evaluation results of three models (RoBERTa, BERT‑cased, and BERT‑uncased) on this dataset are presented.
The YelpReviewFull dataset contains review data collected from the Yelp website, mainly used for sentiment classification tasks. It includes 650,000 training samples and 50,000 test samples, each with a text field and a label field, where the label indicates the review rating (1 to 5 stars). The dataset was created via crowdsourcing and is in English.
The DEAP dataset is a physiological‑signal database for emotion analysis, primarily using EEG signals to assess emotional states, including arousal and valence.
XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article. The dataset originates from BBC articles, language is British English, primarily used for abstractive summarization. The dataset structure includes document, summary, and ID fields, and is randomly split into training, validation, and test sets. The creators are from the University of Edinburgh, and the license is CC BY‑SA 4.0.
This dataset contains hotel reviews and ratings collected from TripAdvisor. After processing, only the review text and multiple aspect scores are retained. Originally released by Jiwei Li et al., the processed data is provided as a single pandas DataFrame. It is primarily intended for aspect‑based sentiment analysis (ABSA). The dataset includes columns such as hotel ID, user ID, review title, review text, overall rating, cleanliness rating, and others.
The ISEAR dataset, developed by the Swiss National Center for Ability Research, is an international survey of emotional antecedents and reactions, suitable for text analysis and sentiment analysis.
NewsMTSC is a high‑quality dataset containing over 11k manually annotated sentences from English news articles. Each sentence is labeled by five human annotators and includes only examples where the annotators’ sentiment judgments are the same or similar. The dataset is split into two subsets (`rw` and `mt`), each containing training, validation, and test parts.
The Sentiment140 dataset contains Twitter messages with emojis, which are used as noisy sentiment labels. It is primarily used for sentiment classification tasks, containing 1,600,000 training instances and 498 test instances. Fields include text, date, user, sentiment, and query.
The dataset contains manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. It supports tasks such as multimodal dialogue generation, automatic speech recognition, and text‑to‑speech conversion. The language is English, and a gold‑standard emotional dialogue subset is provided for studying emotion dynamics in conversations.
This dataset is intended for text‑classification tasks and contains two features: the text content and a label. Labels are binary, with 'neg' (negative) and 'pos' (positive). The data are split into training, validation, and test sets for model training, validation, and testing, respectively.
A synthetic lyrics dataset obtained via the Genius API and web crawling, annotated with theme, emotion, style, tone, and narrative using the Mistral API.
The dataset includes three primary features: 'sentence' (string), 'label' (categorical with two classes: 0 for negative sentiment, 1 for positive sentiment), and 'idx' (integer index). The training set has 68,221 samples, the validation set 872 samples, and the test set 1,821 samples. Total download size is 3,403,184 bytes; total dataset size is 5,110,747 bytes.
Aspect Sentiment Triplet Extraction v2 is designed for extracting tuples consisting of a target entity, its associated sentiment, and the opinion span that explains the sentiment. It focuses on aspect‑based sentiment analysis (ABSA) to identify aspects of target entities and the polarity expressed for each aspect. The data are derived from SemEval 2014, 2015, and 2016 datasets, pre‑processed with spell correction and tokenization. The dataset includes training, validation, and test splits, each line containing index, text, start and end indices for aspect and opinion spans, the aspect and opinion terms, and the sentiment class.
This dataset consists of user comments on various popular games, each paired with a sentiment label (negative or positive), the game name, and a rating. It is divided into training and test sets for potential sentiment analysis or game‑review research.
The dataset comprises Yelp review data for sentiment analysis, specifically comparing the effectiveness of BERT and RoBERTa models on Yelp review sentiment classification.
The Amazon Review Polarity dataset contains product reviews from Amazon, primarily for text‑classification tasks, especially sentiment classification. Reviews rated 1‑2 are labeled negative, 4‑5 positive, and rating 3 is omitted. The dataset includes 3.6 M training samples and 0.4 M test samples; each record comprises a review title, content, and a label (positive or negative). It was created by Xiang Zhang and is widely used as a benchmark for text‑classification research.
The FinancialPhrasebank is a dataset of financial news sentences for sentiment classification. It contains 4,840 English sentences, each classified according to the agreement rate of 5–8 annotators. The dataset is provided in four configurations based on annotator agreement levels (50%, 66%, 75%, and 100%). The purpose of creating the dataset is to address the lack of high‑quality training data for financial sentiment analysis. The dataset was annotated by 16 individuals with background knowledge of financial markets, including researchers and master's students. Use of the dataset is governed by the Creative Commons Attribution‑NonCommercial‑ShareAlike 3.0 Unported License.
The SMILE Twitter Emotion dataset was created by Wang et al. in 2016 and contains tweets annotated with multiple emotions (e.g., happiness, anger, sadness), providing a rich resource for sentiment analysis tasks.