Back to datasets
Dataset assetOpen Source CommunitySentiment AnalysisMultimodal Dialogue

IVLLab/MultiDialog

The dataset contains manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. It supports tasks such as multimodal dialogue generation, automatic speech recognition, and text‑to‑speech conversion. The language is English, and a gold‑standard emotional dialogue subset is provided for studying emotion dynamics in conversations.

Source
hugging_face
Created
Nov 28, 2025
Updated
Aug 29, 2024
Signals
660 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Description

The dataset includes manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. For access to MultiDialog video files, please download here.

Statistics

trainvalid_freqvalid_raretest_freqtest_rareTotal
# dialogues7,0114484434503818,733
# utterances151,6458,5169,5569,8118,331187,859
Avg utterances per dialogue21.6319.0121.5721.8021.8721.51
Avg utterance length (s)6.506.236.406.996.496.51
Avg dialogue length (min)2.341.972.282.542.362.33
Total duration (h)273.9314.7417.0019.0415.01339.71

Example Usage

The dataset includes train, test_freq, test_rare, valid_freq, and valid_rare splits. Example usage:

from datasets import load_dataset

MultiD = load_dataset("IVLLab/MultiDialog", "valid_freq", use_auth_token=True)

# Inspect structure
print(MultiD)

# Dynamically load an audio sample
audio_input = MultiD["valid_freq"][0]["audio"]  # First decoded audio sample
transcription = MultiD["valid_freq"][0]["value"]  # Corresponding transcription

Supported Tasks

  • Multimodal Dialogue Generation: Train end‑to‑end multimodal dialogue models.
  • Automatic Speech Recognition (ASR): Train ASR models.
  • Text‑to‑Speech (TTS): Train TTS models.

Language

The dataset contains English audio and transcriptions.

Gold‑Standard Emotional Dialogue Subset

A gold‑standard emotional dialogue subset is provided for studying emotion dynamics. Dialogues with actors whose emotion accuracy exceeds 40 % are selected. Use the following actor IDs: a, b, c, e, f, g, i, j, and k.

Data Structure

Data Instance

{
    "file_name": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav",
    "conv_id": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b",
    "utterance_id": 0,
    "from": "gpt",
    "audio": {
        "path": "/home/user/.cache/huggingface/datasets/downloads/extracted/cache_id/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5_0i.wav",
        "array": array([...], dtype=float32),
        "sampling_rate": 16000
    },
    "value": "Are you a football fan?",
    "emotion": "Neutral",
    "original_full_path": "valid_freq/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav"
}

Fields

  • file_name (str): Relative path of the audio sample within its split directory.
  • conv_id (str): Unique identifier for each dialogue.
  • utterance_id (float): Index of the utterance.
  • from (str): Source of the message (human or gpt).
  • audio (dict): Contains path, decoded audio array, and sampling_rate. In non‑stream mode the path points to a locally extracted file; in stream mode it is a relative path inside the archive.
  • value (str): Transcription of the utterance.
  • emotion (str): Emotion label of the utterance.
  • original_full_path (str): Relative path to the original audio file in the source dataset.

Emotion labels include: "Neutral", "Happy", "Fear", "Angry", "Disgusting", "Surprising", "Sad".

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio