Explore high-quality datasets for your AI and machine learning projects.
The dataset contains manually annotated metadata linking audio files with transcriptions, emotions, and other attributes. It supports tasks such as multimodal dialogue generation, automatic speech recognition, and text‑to‑speech conversion. The language is English, and a gold‑standard emotional dialogue subset is provided for studying emotion dynamics in conversations.
MM-Conv is a multimodal conversational dataset for virtual humans, created by the Royal Institute of Technology, Sweden. The dataset records dialogues between participants in the AI2-THOR physical simulator using a VR headset, comprising 6.7 hours of synchronized speech, motion capture, facial expressions, and gaze data. The creation process integrates virtual reality and motion capture technologies to ensure richness and structure. This dataset primarily supports the enhancement of gesture generation models in 3D scenes, aiming to address how to generate gestures more naturally and understand spatial information in task-oriented scenarios.