Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions

A Rahate, R Walambe, S Ramanna, K Kotecha - Information Fusion, 2022 - Elsevier
Multimodal deep learning systems that employ multiple modalities like text, image, audio,
video, etc., are showing better performance than individual modalities (ie, unimodal) …

Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Audio–visual segmentation

J Zhou, J Wang, J Zhang, W Sun, J Zhang… - … on Computer Vision, 2022 - Springer
We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …

A joint cross-attention model for audio-visual fusion in dimensional emotion recognition

RG Praveen, WC de Melo, N Ullah… - Proceedings of the …, 2022 - openaccess.thecvf.com
Multi-modal emotion recognition has recently gained much attention since it can leverage
diverse and complementary relationships over multiple modalities, such as audio, visual …

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

K Li, Z Yang, L Chen, Y Yang, J Xiao - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …

Contrastive positive sample propagation along the audio-visual event line

J Zhou, D Guo, M Wang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

Cross-modal background suppression for audio-visual event localization

Y Xia, Z Zhao - Proceedings of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com
Audiovisual Event (AVE) localization requires the model to jointly localize an event by
observing audio and visual information. However, in unconstrained videos, both information …

Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention

RG Praveen, P Cardinal… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Automatic emotion recognition (ER) has recently gained much interest due to its potential in
many real-world applications. In this context, multimodal approaches have been shown to …

Ave-clip: Audioclip-based multi-window temporal transformer for audio visual event localization

T Mahmud, D Marculescu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory
signals in a video segment. Precise localization of the AVEs is very challenging since it …

Semantic and relation modulation for audio-visual event localization

H Wang, ZJ Zha, L Li, X Chen… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
We study the problem of localizing audio-visual events that are both audible and visible in a
video. Existing works focus on encoding and aligning audio and visual features at the …