IVLLab/MultiDialog
The dataset contains manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. It supports tasks such as multimodal dialogue generation, automatic speech recognition, and text‑to‑speech conversion. The language is English, and a gold‑standard emotional dialogue subset is provided for studying emotion dynamics in conversations.
Dataset description and usage context
Dataset Description
The dataset includes manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. For access to MultiDialog video files, please download here.
Statistics
| train | valid_freq | valid_rare | test_freq | test_rare | Total | |
|---|---|---|---|---|---|---|
| # dialogues | 7,011 | 448 | 443 | 450 | 381 | 8,733 |
| # utterances | 151,645 | 8,516 | 9,556 | 9,811 | 8,331 | 187,859 |
| Avg utterances per dialogue | 21.63 | 19.01 | 21.57 | 21.80 | 21.87 | 21.51 |
| Avg utterance length (s) | 6.50 | 6.23 | 6.40 | 6.99 | 6.49 | 6.51 |
| Avg dialogue length (min) | 2.34 | 1.97 | 2.28 | 2.54 | 2.36 | 2.33 |
| Total duration (h) | 273.93 | 14.74 | 17.00 | 19.04 | 15.01 | 339.71 |
Example Usage
The dataset includes train, test_freq, test_rare, valid_freq, and valid_rare splits. Example usage:
from datasets import load_dataset
MultiD = load_dataset("IVLLab/MultiDialog", "valid_freq", use_auth_token=True)
# Inspect structure
print(MultiD)
# Dynamically load an audio sample
audio_input = MultiD["valid_freq"][0]["audio"] # First decoded audio sample
transcription = MultiD["valid_freq"][0]["value"] # Corresponding transcription
Supported Tasks
- Multimodal Dialogue Generation: Train end‑to‑end multimodal dialogue models.
- Automatic Speech Recognition (ASR): Train ASR models.
- Text‑to‑Speech (TTS): Train TTS models.
Language
The dataset contains English audio and transcriptions.
Gold‑Standard Emotional Dialogue Subset
A gold‑standard emotional dialogue subset is provided for studying emotion dynamics. Dialogues with actors whose emotion accuracy exceeds 40 % are selected. Use the following actor IDs: a, b, c, e, f, g, i, j, and k.
Data Structure
Data Instance
{
"file_name": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav",
"conv_id": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b",
"utterance_id": 0,
"from": "gpt",
"audio": {
"path": "/home/user/.cache/huggingface/datasets/downloads/extracted/cache_id/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5_0i.wav",
"array": array([...], dtype=float32),
"sampling_rate": 16000
},
"value": "Are you a football fan?",
"emotion": "Neutral",
"original_full_path": "valid_freq/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav"
}
Fields
file_name(str): Relative path of the audio sample within its split directory.conv_id(str): Unique identifier for each dialogue.utterance_id(float): Index of the utterance.from(str): Source of the message (humanorgpt).audio(dict): Containspath, decoded audioarray, andsampling_rate. In non‑stream mode the path points to a locally extracted file; in stream mode it is a relative path inside the archive.value(str): Transcription of the utterance.emotion(str): Emotion label of the utterance.original_full_path(str): Relative path to the original audio file in the source dataset.
Emotion labels include: "Neutral", "Happy", "Fear", "Angry", "Disgusting", "Surprising", "Sad".
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.