JUHE API Marketplace
DATASET
Open Source Community

Cornell Movie Dialogs Corpus

The Cornell Movie Dialogs Corpus is a collection of fictional dialogues extracted from movie scripts. Due to its richness and diversity, it is well suited for training and evaluating dialogue agents.

Updated 10/1/2024
github

Description

Context‑Aware‑Chatbot‑Using‑DialoGPT

Dataset Description

Dataset Name

Cornell Movie Dialogs Corpus

Data Source

https://www.kaggle.com/datasets/rajathmc/cornell-moviedialog-corpus

Data Content

  • Number of movies: 617
  • Number of characters: over 10,000
  • Number of dialogues: ~83,000
  • Number of utterances: ~304,000
  • Language: primarily English

Data Structure

  • movie_lines.txt: Contains individual utterance lines with fields:
    • Line ID: unique identifier for each utterance
    • Character ID: identifier of the speaking character
    • Movie ID: identifier of the movie
    • Character Name: name of the character
    • Utterance Text: actual dialogue line
  • movie_conversations.txt: Defines conversations by listing sequences of line IDs that form each dialogue between character pairs

Dataset Size

Total size is about 20 MB, plain‑text format.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Dialogue Systems

Source

Organization: github

Created: 10/1/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.