MM-Conv

MM-Conv is a multimodal conversational dataset for virtual humans, created by the Royal Institute of Technology, Sweden. The dataset records dialogues between participants in the AI2-THOR physical simulator using a VR headset, comprising 6.7 hours of synchronized speech, motion capture, facial expressions, and gaze data. The creation process integrates virtual reality and motion capture technologies to ensure richness and structure. This dataset primarily supports the enhancement of gesture generation models in 3D scenes, aiming to address how to generate gestures more naturally and understand spatial information in task-oriented scenarios.

Updated 10/1/2024

arXiv

Description

MM-Conv: A Multi-Modal Conversational Dataset for Virtual Humans

Overview

MM-Conv is a multimodal conversational dataset for virtual humans, developed by Anna Deichler, Jim ORegan, and Jonas Beskow at KTH Royal Institute of Technology and presented at the ECCV Multimodal Agents Workshop.

Authors

Anna Deichler
Jim ORegan
Jonas Beskow

Institution

KTH Royal Institute of Technology

Workshop

ECCV Multimodal Agents Workshop

Abstract

This paper introduces a new dataset that records dialogues between participants in a physical simulator (AI2-THOR) using VR head-mounted devices. The main goal is to expand the field of collaborative speech-gesture generation by incorporating rich contextual information in referential settings. Participants engaged in various dialogue scenarios based on referential communication tasks. The dataset provides extensive multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to improve the understanding and development of gesture generation models in 3D scenes by offering diverse and context‑rich data.

Background

Referential communication is a specific mode of interaction that often occurs in situational dialogues. It includes identifying, describing, or giving instructions related to objects, locations, or people. This form of communication links perception of the surrounding environment with conceptual understanding. It relies on multimodal expressions, including spatial language and non‑verbal behaviors such as gaze and pointing gestures. When discussing spatial context, pointing or gesturing becomes an important complement to spatial language, providing a more direct and usually clearer way to specify locations or draw attention to particular objects or regions. For agents to effectively comprehend and participate in referential communication within situational dialogues, they need to be able to interpret and generate verbal spatial references and non‑verbal cues such as pointing gestures and gaze. This dual capability enables finer and more efficient information exchange.

MM-Conv

Description

MM-Conv: A Multi-Modal Conversational Dataset for Virtual Humans

Overview

Authors

Institution

Workshop

Abstract

Background

Related Links

AI studio

Access Dataset

Topics

Source