Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingDialogue Systems
Cornell Movie Dialogs Corpus
The Cornell Movie Dialogs Corpus is a collection of fictional dialogues extracted from movie scripts. Due to its richness and diversity, it is well suited for training and evaluating dialogue agents.
Source
github
Created
Oct 1, 2024
Updated
Oct 1, 2024
Signals
515 views
Availability
Linked source ready
Overview
Dataset description and usage context
Context‑Aware‑Chatbot‑Using‑DialoGPT
Dataset Description
Dataset Name
Cornell Movie Dialogs Corpus
Data Source
https://www.kaggle.com/datasets/rajathmc/cornell-moviedialog-corpus
Data Content
- Number of movies: 617
- Number of characters: over 10,000
- Number of dialogues: ~83,000
- Number of utterances: ~304,000
- Language: primarily English
Data Structure
- movie_lines.txt: Contains individual utterance lines with fields:
- Line ID: unique identifier for each utterance
- Character ID: identifier of the speaking character
- Movie ID: identifier of the movie
- Character Name: name of the character
- Utterance Text: actual dialogue line
- movie_conversations.txt: Defines conversations by listing sequences of line IDs that form each dialogue between character pairs
Dataset Size
Total size is about 20 MB, plain‑text format.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.