IVLLab/MultiDialog
The dataset contains manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. It supports tasks such as multimodal dialogue generation, automatic speech recognition, and text‑to‑speech conversion. The language is English, and a gold‑standard emotional dialogue subset is provided for studying emotion dynamics in conversations.
Description
Dataset Description
The dataset includes manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. For access to MultiDialog video files, please download here.
Statistics
| train | valid_freq | valid_rare | test_freq | test_rare | Total | |
|---|---|---|---|---|---|---|
| # dialogues | 7,011 | 448 | 443 | 450 | 381 | 8,733 |
| # utterances | 151,645 | 8,516 | 9,556 | 9,811 | 8,331 | 187,859 |
| Avg utterances per dialogue | 21.63 | 19.01 | 21.57 | 21.80 | 21.87 | 21.51 |
| Avg utterance length (s) | 6.50 | 6.23 | 6.40 | 6.99 | 6.49 | 6.51 |
| Avg dialogue length (min) | 2.34 | 1.97 | 2.28 | 2.54 | 2.36 | 2.33 |
| Total duration (h) | 273.93 | 14.74 | 17.00 | 19.04 | 15.01 | 339.71 |
Example Usage
The dataset includes train, test_freq, test_rare, valid_freq, and valid_rare splits. Example usage:
from datasets import load_dataset
MultiD = load_dataset("IVLLab/MultiDialog", "valid_freq", use_auth_token=True)
# Inspect structure
print(MultiD)
# Dynamically load an audio sample
audio_input = MultiD["valid_freq"][0]["audio"] # First decoded audio sample
transcription = MultiD["valid_freq"][0]["value"] # Corresponding transcription
Supported Tasks
- Multimodal Dialogue Generation: Train end‑to‑end multimodal dialogue models.
- Automatic Speech Recognition (ASR): Train ASR models.
- Text‑to‑Speech (TTS): Train TTS models.
Language
The dataset contains English audio and transcriptions.
Gold‑Standard Emotional Dialogue Subset
A gold‑standard emotional dialogue subset is provided for studying emotion dynamics. Dialogues with actors whose emotion accuracy exceeds 40 % are selected. Use the following actor IDs: a, b, c, e, f, g, i, j, and k.
Data Structure
Data Instance
{
"file_name": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav",
"conv_id": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b",
"utterance_id": 0,
"from": "gpt",
"audio": {
"path": "/home/user/.cache/huggingface/datasets/downloads/extracted/cache_id/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5_0i.wav",
"array": array([...], dtype=float32),
"sampling_rate": 16000
},
"value": "Are you a football fan?",
"emotion": "Neutral",
"original_full_path": "valid_freq/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav"
}
Fields
file_name(str): Relative path of the audio sample within its split directory.conv_id(str): Unique identifier for each dialogue.utterance_id(float): Index of the utterance.from(str): Source of the message (humanorgpt).audio(dict): Containspath, decoded audioarray, andsampling_rate. In non‑stream mode the path points to a locally extracted file; in stream mode it is a relative path inside the archive.value(str): Transcription of the utterance.emotion(str): Emotion label of the utterance.original_full_path(str): Relative path to the original audio file in the source dataset.
Emotion labels include: "Neutral", "Happy", "Fear", "Angry", "Disgusting", "Surprising", "Sad".
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.