JUHE API Marketplace
DATASET
Open Source Community

IVLLab/MultiDialog

The dataset contains manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. It supports tasks such as multimodal dialogue generation, automatic speech recognition, and text‑to‑speech conversion. The language is English, and a gold‑standard emotional dialogue subset is provided for studying emotion dynamics in conversations.

Updated 8/29/2024
hugging_face

Description

Dataset Description

The dataset includes manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. For access to MultiDialog video files, please download here.

Statistics

trainvalid_freqvalid_raretest_freqtest_rareTotal
# dialogues7,0114484434503818,733
# utterances151,6458,5169,5569,8118,331187,859
Avg utterances per dialogue21.6319.0121.5721.8021.8721.51
Avg utterance length (s)6.506.236.406.996.496.51
Avg dialogue length (min)2.341.972.282.542.362.33
Total duration (h)273.9314.7417.0019.0415.01339.71

Example Usage

The dataset includes train, test_freq, test_rare, valid_freq, and valid_rare splits. Example usage:

from datasets import load_dataset

MultiD = load_dataset("IVLLab/MultiDialog", "valid_freq", use_auth_token=True)

# Inspect structure
print(MultiD)

# Dynamically load an audio sample
audio_input = MultiD["valid_freq"][0]["audio"]  # First decoded audio sample
transcription = MultiD["valid_freq"][0]["value"]  # Corresponding transcription

Supported Tasks

  • Multimodal Dialogue Generation: Train end‑to‑end multimodal dialogue models.
  • Automatic Speech Recognition (ASR): Train ASR models.
  • Text‑to‑Speech (TTS): Train TTS models.

Language

The dataset contains English audio and transcriptions.

Gold‑Standard Emotional Dialogue Subset

A gold‑standard emotional dialogue subset is provided for studying emotion dynamics. Dialogues with actors whose emotion accuracy exceeds 40 % are selected. Use the following actor IDs: a, b, c, e, f, g, i, j, and k.

Data Structure

Data Instance

{
    "file_name": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav",
    "conv_id": "t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b",
    "utterance_id": 0,
    "from": "gpt",
    "audio": {
        "path": "/home/user/.cache/huggingface/datasets/downloads/extracted/cache_id/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5/t_152ee99a-fec0-4d37-87a8-b1510a9dc7e5_0i.wav",
        "array": array([...], dtype=float32),
        "sampling_rate": 16000
    },
    "value": "Are you a football fan?",
    "emotion": "Neutral",
    "original_full_path": "valid_freq/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b/t_ffa55df6-114d-4b36-87a1-7af6b8b63d9b_0k.wav"
}

Fields

  • file_name (str): Relative path of the audio sample within its split directory.
  • conv_id (str): Unique identifier for each dialogue.
  • utterance_id (float): Index of the utterance.
  • from (str): Source of the message (human or gpt).
  • audio (dict): Contains path, decoded audio array, and sampling_rate. In non‑stream mode the path points to a locally extracted file; in stream mode it is a relative path inside the archive.
  • value (str): Transcription of the utterance.
  • emotion (str): Emotion label of the utterance.
  • original_full_path (str): Relative path to the original audio file in the source dataset.

Emotion labels include: "Neutral", "Happy", "Fear", "Angry", "Disgusting", "Surprising", "Sad".

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Multimodal Dialogue
Sentiment Analysis

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.