Back to datasets
Dataset assetOpen Source CommunityMovie ReviewsEmotion Classification

Large Movie Review Dataset v1.0

This dataset contains movie reviews and their associated binary sentiment polarity labels, intended as a benchmark for sentiment classification. The core of the dataset consists of 50,000 reviews, evenly split into 25,000 training and 25,000 test samples with balanced labels (25 k positive and 25 k negative). Additionally, 50,000 unlabeled documents are provided for unsupervised learning. No more than 30 reviews per movie are included, and movies in the training and test sets do not overlap to prevent memorization of movie‑specific terms and associated labels. In the labeled training/test sets, negative reviews have scores ≤ 4/10 and positive reviews have scores ≥ 7/10. In the unsupervised set, reviews of all scores are included, with equal numbers of reviews scoring above 5 and at most 5.

Source
github
Created
Oct 18, 2019
Updated
May 28, 2020
Signals
266 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Large Movie Review Dataset v1.0

Dataset Purpose

The dataset is used as a benchmark for sentiment classification, containing movie reviews with binary sentiment polarity labels.

Dataset Content

Dataset Structure

  • Core Dataset: Contains 50,000 reviews, split into 25,000 training and 25,000 test samples with balanced labels (25 k positive and 25 k negative).
  • Unsupervised Learning Dataset: An additional 50,000 unlabeled documents.

Dataset Characteristics

  • A maximum of 30 reviews per movie to avoid inter‑review correlation.
  • No overlap of movies between training and test sets, preventing performance gains from memorizing movie‑specific terminology and label associations.
  • In the labeled training/test sets, negative reviews have scores ≤ 4/10, positive reviews have scores ≥ 7/10.
  • In the unsupervised set, reviews of all scores are included, with equal numbers of reviews scoring above 5 and at most 5.

File Structure

Organization

  • Training and Test Sets: Correspond to [train/] and [test/] directories.
  • Label Categories: Each directory contains [pos/] and [neg/] subdirectories for positive and negative reviews.
  • File Naming: Review text files follow the [[id]_[rating].txt] convention, where [id] is a unique identifier and [rating] is a 1‑10 star rating.
  • URL Files: Contain IMDb URLs for each review, formatted as [urls_[pos, neg, unsup].txt].
  • Feature Files: Provide tokenized bag‑of‑words features stored as .feat files in LIBSVM format.
  • Vocabulary: The [imdb.vocab] file stores text tokens corresponding to feature indices.
  • Expected Rating File: [imdbEr.txt] contains the expected rating for each token in [imdb.vocab].

Citation Information

  • When using this dataset, please cite the ACL 2011 conference paper that introduced the dataset and provided baseline classification results for comparison.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio