DATASET
Open Source Community
Cornell Movie Dialogs Corpus
The Cornell Movie Dialogs Corpus is a collection of fictional dialogues extracted from movie scripts. Due to its richness and diversity, it is well suited for training and evaluating dialogue agents.
Updated 10/1/2024
github
Description
Context‑Aware‑Chatbot‑Using‑DialoGPT
Dataset Description
Dataset Name
Cornell Movie Dialogs Corpus
Data Source
https://www.kaggle.com/datasets/rajathmc/cornell-moviedialog-corpus
Data Content
- Number of movies: 617
- Number of characters: over 10,000
- Number of dialogues: ~83,000
- Number of utterances: ~304,000
- Language: primarily English
Data Structure
- movie_lines.txt: Contains individual utterance lines with fields:
- Line ID: unique identifier for each utterance
- Character ID: identifier of the speaking character
- Movie ID: identifier of the movie
- Character Name: name of the character
- Utterance Text: actual dialogue line
- movie_conversations.txt: Defines conversations by listing sequences of line IDs that form each dialogue between character pairs
Dataset Size
Total size is about 20 MB, plain‑text format.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Natural Language Processing
Dialogue Systems
Source
Organization: github
Created: 10/1/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.