Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects

S Zhang, Y Yang, C Chen, X Zhang, Q Leng… - Expert Systems with …, 2024 - Elsevier
Emotion recognition has recently attracted extensive interest due to its significant
applications to human–computer interaction. The expression of human emotion depends on …

A survey of multimodal deep generative models

M Suzuki, Y Matsuo - Advanced Robotics, 2022 - Taylor & Francis
Multimodal learning is a framework for building models that make predictions based on
different types of modalities. Important challenges in multimodal learning are the inference of …

Dawn of the transformer era in speech emotion recognition: closing the valence gap

J Wagner, A Triantafyllopoulos… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Recent advances in transformer-based architectures have shown promise in several
machine learning tasks. In the audio domain, such architectures have been successfully …

Decoupled multimodal distilling for emotion recognition

Y Li, Y Wang, Z Cui - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Human multimodal emotion recognition (MER) aims to perceive human emotions via
language, visual and acoustic modalities. Despite the impressive performance of previous …

Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis

W Han, H Chen, S Poria - arXiv preprint arXiv:2109.00412, 2021 - arxiv.org
In multimodal sentiment analysis (MSA), the performance of a model highly depends on the
quality of synthesized embeddings. These embeddings are generated from the upstream …

Disentangled representation learning for multimodal emotion recognition

D Yang, S Huang, H Kuang, Y Du… - Proceedings of the 30th …, 2022 - dl.acm.org
Multimodal emotion recognition aims to identify human emotions from text, audio, and visual
modalities. Previous methods either explore correlations between different modalities or …

Misa: Modality-invariant and-specific representations for multimodal sentiment analysis

D Hazarika, R Zimmermann, S Poria - Proceedings of the 28th ACM …, 2020 - dl.acm.org
Multimodal Sentiment Analysis is an active area of research that leverages multimodal
signals for affective understanding of user-generated videos. The predominant approach …

[HTML][HTML] Multimodal transformer for unaligned multimodal language sequences

YHH Tsai, S Bai, PP Liang, JZ Kolter… - Proceedings of the …, 2019 - ncbi.nlm.nih.gov
Human language is often multimodal, which comprehends a mixture of natural language,
facial gestures, and acoustic behaviors. However, two major challenges in modeling such …

Univl: A unified video and language pre-training model for multimodal understanding and generation

H Luo, L Ji, B Shi, H Huang, N Duan, T Li, J Li… - arXiv preprint arXiv …, 2020 - arxiv.org
With the recent success of the pre-training technique for NLP and image-linguistic tasks,
some video-linguistic pre-training works are gradually developed to improve video-text …

Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph

AAB Zadeh, PP Liang, S Poria, E Cambria… - Proceedings of the …, 2018 - aclanthology.org
Analyzing human multimodal language is an emerging area of research in NLP. Intrinsically
this language is multimodal (heterogeneous), sequential and asynchronous; it consists of …