MM-Conv
MM-Conv is a multimodal conversational dataset for virtual humans, created by the Royal Institute of Technology, Sweden. The dataset records dialogues between participants in the AI2-THOR physical simulator using a VR headset, comprising 6.7 hours of synchronized speech, motion capture, facial expressions, and gaze data. The creation process integrates virtual reality and motion capture technologies to ensure richness and structure. This dataset primarily supports the enhancement of gesture generation models in 3D scenes, aiming to address how to generate gestures more naturally and understand spatial information in task-oriented scenarios.
Description
MM-Conv: A Multi-Modal Conversational Dataset for Virtual Humans
Overview
MM-Conv is a multimodal conversational dataset for virtual humans, developed by Anna Deichler, Jim ORegan, and Jonas Beskow at KTH Royal Institute of Technology and presented at the ECCV Multimodal Agents Workshop.
Authors
- Anna Deichler
- Jim ORegan
- Jonas Beskow
Institution
KTH Royal Institute of Technology
Workshop
ECCV Multimodal Agents Workshop
Abstract
This paper introduces a new dataset that records dialogues between participants in a physical simulator (AI2-THOR) using VR head-mounted devices. The main goal is to expand the field of collaborative speech-gesture generation by incorporating rich contextual information in referential settings. Participants engaged in various dialogue scenarios based on referential communication tasks. The dataset provides extensive multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to improve the understanding and development of gesture generation models in 3D scenes by offering diverse and context‑rich data.
Background
Referential communication is a specific mode of interaction that often occurs in situational dialogues. It includes identifying, describing, or giving instructions related to objects, locations, or people. This form of communication links perception of the surrounding environment with conceptual understanding. It relies on multimodal expressions, including spatial language and non‑verbal behaviors such as gaze and pointing gestures. When discussing spatial context, pointing or gesturing becomes an important complement to spatial language, providing a more direct and usually clearer way to specify locations or draw attention to particular objects or regions. For agents to effectively comprehend and participate in referential communication within situational dialogues, they need to be able to interpret and generate verbal spatial references and non‑verbal cues such as pointing gestures and gaze. This dual capability enables finer and more efficient information exchange.
Related Links
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 10/1/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.