Dataset assetOpen Source CommunityNatural Language ProcessingDialogue Systems

Cornell Movie Dialogs Corpus

The Cornell Movie Dialogs Corpus is a collection of fictional dialogues extracted from movie scripts. Due to its richness and diversity, it is well suited for training and evaluating dialogue agents.

Source

github

Created

Oct 1, 2024

Updated

Oct 1, 2024

Signals

515 views

Availability

Linked source ready

Overview

Dataset description and usage context

Context‑Aware‑Chatbot‑Using‑DialoGPT

Dataset Description

Dataset Name

Cornell Movie Dialogs Corpus

Data Source

https://www.kaggle.com/datasets/rajathmc/cornell-moviedialog-corpus

Data Content

Number of movies: 617
Number of characters: over 10,000
Number of dialogues: ~83,000
Number of utterances: ~304,000
Language: primarily English

Data Structure

movie_lines.txt: Contains individual utterance lines with fields:
- Line ID: unique identifier for each utterance
- Character ID: identifier of the speaking character
- Movie ID: identifier of the movie
- Character Name: name of the character
- Utterance Text: actual dialogue line
movie_conversations.txt: Defines conversations by listing sequences of line IDs that form each dialogue between character pairs

Dataset Size

Total size is about 20 MB, plain‑text format.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio