Large Movie Review Dataset v1.0

This dataset contains movie reviews and their associated binary sentiment polarity labels, intended as a benchmark for sentiment classification. The core of the dataset consists of 50,000 reviews, evenly split into 25,000 training and 25,000 test samples with balanced labels (25 k positive and 25 k negative). Additionally, 50,000 unlabeled documents are provided for unsupervised learning. No more than 30 reviews per movie are included, and movies in the training and test sets do not overlap to prevent memorization of movie‑specific terms and associated labels. In the labeled training/test sets, negative reviews have scores ≤ 4/10 and positive reviews have scores ≥ 7/10. In the unsupervised set, reviews of all scores are included, with equal numbers of reviews scoring above 5 and at most 5.

Updated 5/28/2020

github

Description

Dataset Overview

Dataset Name

Large Movie Review Dataset v1.0

Dataset Purpose

The dataset is used as a benchmark for sentiment classification, containing movie reviews with binary sentiment polarity labels.

Dataset Content

Dataset Structure

Core Dataset: Contains 50,000 reviews, split into 25,000 training and 25,000 test samples with balanced labels (25 k positive and 25 k negative).
Unsupervised Learning Dataset: An additional 50,000 unlabeled documents.

Dataset Characteristics

A maximum of 30 reviews per movie to avoid inter‑review correlation.
No overlap of movies between training and test sets, preventing performance gains from memorizing movie‑specific terminology and label associations.
In the labeled training/test sets, negative reviews have scores ≤ 4/10, positive reviews have scores ≥ 7/10.
In the unsupervised set, reviews of all scores are included, with equal numbers of reviews scoring above 5 and at most 5.

File Structure

Organization

Training and Test Sets: Correspond to [train/] and [test/] directories.
Label Categories: Each directory contains [pos/] and [neg/] subdirectories for positive and negative reviews.
File Naming: Review text files follow the [[id]_[rating].txt] convention, where [id] is a unique identifier and [rating] is a 1‑10 star rating.
URL Files: Contain IMDb URLs for each review, formatted as [urls_[pos, neg, unsup].txt].
Feature Files: Provide tokenized bag‑of‑words features stored as .feat files in LIBSVM format.
Vocabulary: The [imdb.vocab] file stores text tokens corresponding to feature indices.
Expected Rating File: [imdbEr.txt] contains the expected rating for each token in [imdb.vocab].

Citation Information

When using this dataset, please cite the ACL 2011 conference paper that introduced the dataset and provided baseline classification results for comparison.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Emotion Classification

Movie Reviews

Source

Organization: github

Created: 10/18/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →