Large Movie Review Dataset v1.0
This dataset contains movie reviews and their associated binary sentiment polarity labels, intended as a benchmark for sentiment classification. The core of the dataset consists of 50,000 reviews, evenly split into 25,000 training and 25,000 test samples with balanced labels (25 k positive and 25 k negative). Additionally, 50,000 unlabeled documents are provided for unsupervised learning. No more than 30 reviews per movie are included, and movies in the training and test sets do not overlap to prevent memorization of movie‑specific terms and associated labels. In the labeled training/test sets, negative reviews have scores ≤ 4/10 and positive reviews have scores ≥ 7/10. In the unsupervised set, reviews of all scores are included, with equal numbers of reviews scoring above 5 and at most 5.
Description
Dataset Overview
Dataset Name
Large Movie Review Dataset v1.0
Dataset Purpose
The dataset is used as a benchmark for sentiment classification, containing movie reviews with binary sentiment polarity labels.
Dataset Content
Dataset Structure
- Core Dataset: Contains 50,000 reviews, split into 25,000 training and 25,000 test samples with balanced labels (25 k positive and 25 k negative).
- Unsupervised Learning Dataset: An additional 50,000 unlabeled documents.
Dataset Characteristics
- A maximum of 30 reviews per movie to avoid inter‑review correlation.
- No overlap of movies between training and test sets, preventing performance gains from memorizing movie‑specific terminology and label associations.
- In the labeled training/test sets, negative reviews have scores ≤ 4/10, positive reviews have scores ≥ 7/10.
- In the unsupervised set, reviews of all scores are included, with equal numbers of reviews scoring above 5 and at most 5.
File Structure
Organization
- Training and Test Sets: Correspond to
[train/]and[test/]directories. - Label Categories: Each directory contains
[pos/]and[neg/]subdirectories for positive and negative reviews. - File Naming: Review text files follow the
[[id]_[rating].txt]convention, where[id]is a unique identifier and[rating]is a 1‑10 star rating. - URL Files: Contain IMDb URLs for each review, formatted as
[urls_[pos, neg, unsup].txt]. - Feature Files: Provide tokenized bag‑of‑words features stored as
.featfiles in LIBSVM format. - Vocabulary: The
[imdb.vocab]file stores text tokens corresponding to feature indices. - Expected Rating File:
[imdbEr.txt]contains the expected rating for each token in[imdb.vocab].
Citation Information
- When using this dataset, please cite the ACL 2011 conference paper that introduced the dataset and provided baseline classification results for comparison.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 10/18/2019
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.