IVLLab/MultiDialog

Dataset Description

The dataset includes manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. For access to MultiDialog video files, please download here.

Statistics

	train	valid_freq	valid_rare	test_freq	test_rare	Total
# dialogues	7,011	448	443	450	381	8,733
# utterances	151,645	8,516	9,556	9,811	8,331	187,859
Avg utterances per dialogue	21.63	19.01	21.57	21.80	21.87	21.51
Avg utterance length (s)	6.50	6.23	6.40	6.99	6.49	6.51
Avg dialogue length (min)	2.34	1.97	2.28	2.54	2.36	2.33
Total duration (h)	273.93	14.74	17.00	19.04	15.01	339.71

Example Usage

The dataset includes train, test_freq, test_rare, valid_freq, and valid_rare splits. Example usage:

from datasets import load_dataset

MultiD = load_dataset("IVLLab/MultiDialog", "valid_freq", use_auth_token=True)

# Inspect structure
print(MultiD)

# Dynamically load an audio sample
audio_input = MultiD["valid_freq"][0]["audio"]  # First decoded audio sample
transcription = MultiD["valid_freq"][0]["value"]  # Corresponding transcription

Supported Tasks

Multimodal Dialogue Generation: Train end‑to‑end multimodal dialogue models.
Automatic Speech Recognition (ASR): Train ASR models.
Text‑to‑Speech (TTS): Train TTS models.

Language

The dataset contains English audio and transcriptions.

Gold‑Standard Emotional Dialogue Subset

A gold‑standard emotional dialogue subset is provided for studying emotion dynamics. Dialogues with actors whose emotion accuracy exceeds 40 % are selected. Use the following actor IDs: a, b, c, e, f, g, i, j, and k.

Data Structure

Data Instance

{
    "file_name": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav",
    "conv_id": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b",
    "utterance_id": 0,
    "from": "gpt",
    "audio": {
        "path": "/home/user/.cache/huggingface/datasets/downloads/extracted/cache_id/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5_0i.wav",
        "array": array([...], dtype=float32),
        "sampling_rate": 16000
    },
    "value": "Are you a football fan?",
    "emotion": "Neutral",
    "original_full_path": "valid_freq/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav"
}

Fields

file_name (str): Relative path of the audio sample within its split directory.
conv_id (str): Unique identifier for each dialogue.
utterance_id (float): Index of the utterance.
from (str): Source of the message (human or gpt).
audio (dict): Contains path, decoded audio array, and sampling_rate. In non‑stream mode the path points to a locally extracted file; in stream mode it is a relative path inside the archive.
value (str): Transcription of the utterance.
emotion (str): Emotion label of the utterance.
original_full_path (str): Relative path to the original audio file in the source dataset.

Emotion labels include: "Neutral", "Happy", "Fear", "Angry", "Disgusting", "Surprising", "Sad".

Description

Dataset Description

Statistics

Example Usage

Supported Tasks

Language

Gold‑Standard Emotional Dialogue Subset

Data Structure

Data Instance

Fields

AI studio

Access Dataset

Topics

Source