" Do you follow me?": A Survey of Recent Approaches in Dialogue State Tracking

L Jacqmin, LM Rojas-Barahona, B Favre - arXiv preprint arXiv:2207.14627, 2022 - arxiv.org
While communicating with a user, a task-oriented dialogue system has to track the user's
needs at each turn according to the conversation history. This process called dialogue state …

Joyful: Joint modality fusion and graph contrastive learning for multimodal emotion recognition

D Li, Y Wang, K Funakoshi, M Okumura - arXiv preprint arXiv:2311.11009, 2023 - arxiv.org
Multimodal emotion recognition aims to recognize emotions for each utterance of multiple
modalities, which has received increasing attention for its application in human-machine …

Multi-modal Video Dialog State Tracking in the Wild

A Abdessaied, L Shi, A Bulling - European Conference on Computer …, 2025 - Springer
Abstract We present\(\mathbb {MST} _\mathbb {MIXER}\)–a novel video dialog model
operating over a generic multi-modal state tracking scheme. Current models that claim to …

VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue

Y Li, B Hui, Z Yin, W He, R Luo, Y Long, M Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
Visually-grounded dialog systems, which integrate multiple modes of communication such
as text and visual inputs, have become an increasingly popular area of investigation …

OSCaR: Object State Captioning and State Change Representation

N Nguyen, J Bi, A Vosoughi, Y Tian, P Fazli… - arXiv preprint arXiv …, 2024 - arxiv.org
The capability of intelligent models to extrapolate and comprehend changes in object states
is a crucial yet demanding aspect of AI research, particularly through the lens of human …

Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

H Wang, B Guo, M Chen, Q Zhang, Y Ding… - Frontiers of Computer …, 2025 - Springer
Abstract Video-Grounded Dialogue System (VGDS), focusing on generating reasonable
responses based on multi-turn dialogue contexts and a given video, has received intensive …

HERO: A Multi-modal Approach on Mobile Devices for Visual-Aware Conversational Assistance in Industrial Domains

C Bonanno, F Ragusa, A Furnari… - … Conference on Image …, 2023 - Springer
We present HERO, an artificial assistant designed to communicate with users with both
natural language and images to aid them carrying out procedures in industrial contexts. Our …

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

A Abdessaied, M von Hochmeister, A Bulling - arXiv preprint arXiv …, 2024 - arxiv.org
We present the Object Language Video Transformer (OLViT)-a novel model for video dialog
operating over a multi-modal attention-based dialog state tracker. Existing video dialog …

Talking with Machines: A Comprehensive Survey of Emergent Dialogue Systems

W Tholke - arXiv preprint arXiv:2305.16324, 2023 - arxiv.org
From the earliest experiments in the 20th century to the utilization of large language models
and transformers, dialogue systems research has continued to evolve, playing crucial roles …

Enhancing Augmented Reality Dialogue Systems with Multi-Modal Referential Information

Z He, Z Cai - 2023 China Automation Congress (CAC), 2023 - ieeexplore.ieee.org
In this paper, we present a novel approach to advancing augmented reality (AR) dialogue
systems, bridging the gap between two-dimensional spaces and immersive virtual …