Explore high-quality datasets for your AI and machine learning projects.
This dataset contains movie reviews and their associated binary sentiment polarity labels, intended as a benchmark for sentiment classification. The core of the dataset consists of 50,000 reviews, evenly split into 25,000 training and 25,000 test samples with balanced labels (25 k positive and 25 k negative). Additionally, 50,000 unlabeled documents are provided for unsupervised learning. No more than 30 reviews per movie are included, and movies in the training and test sets do not overlap to prevent memorization of movie‑specific terms and associated labels. In the labeled training/test sets, negative reviews have scores ≤ 4/10 and positive reviews have scores ≥ 7/10. In the unsupervised set, reviews of all scores are included, with equal numbers of reviews scoring above 5 and at most 5.