Audio-visual event localization via recursive fusion by joint co-attention

Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions

A Rahate, R Walambe, S Ramanna, K Kotecha - Information Fusion, 2022 - Elsevier

Multimodal deep learning systems that employ multiple modalities like text, image, audio,
video, etc., are showing better performance than individual modalities (ie, unimodal) …

被引用次数：138 相关文章所有 4 个版本

[PDF] arxiv.org

Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

被引用次数：59 相关文章所有 2 个版本

[PDF] arxiv.org

Audio–visual segmentation

J Zhou, J Wang, J Zhang, W Sun, J Zhang… - … on Computer Vision, 2022 - Springer

We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …

被引用次数：124 相关文章所有 5 个版本

[PDF] thecvf.com

A joint cross-attention model for audio-visual fusion in dimensional emotion recognition

RG Praveen, WC de Melo, N Ullah… - Proceedings of the …, 2022 - openaccess.thecvf.com

Multi-modal emotion recognition has recently gained much attention since it can leverage
diverse and complementary relationships over multiple modalities, such as audio, visual …

被引用次数：66 相关文章所有 8 个版本

[PDF] arxiv.org

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation

K Li, Z Yang, L Chen, Y Yang, J Xiao - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …

被引用次数：41 相关文章所有 4 个版本

[PDF] arxiv.org

Contrastive positive sample propagation along the audio-visual event line

J Zhou, D Guo, M Wang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org

Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

被引用次数：41 相关文章所有 7 个版本

[PDF] thecvf.com

Cross-modal background suppression for audio-visual event localization

Y Xia, Z Zhao - Proceedings of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com

Audiovisual Event (AVE) localization requires the model to jointly localize an event by
observing audio and visual information. However, in unconstrained videos, both information …

被引用次数：45 相关文章所有 3 个版本

[PDF] arxiv.org

Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention

RG Praveen, P Cardinal… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Automatic emotion recognition (ER) has recently gained much interest due to its potential in
many real-world applications. In this context, multimodal approaches have been shown to …

被引用次数：37 相关文章所有 5 个版本

[PDF] thecvf.com

Ave-clip: Audioclip-based multi-window temporal transformer for audio visual event localization

T Mahmud, D Marculescu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory
signals in a video segment. Precise localization of the AVEs is very challenging since it …

被引用次数：25 相关文章所有 5 个版本

Semantic and relation modulation for audio-visual event localization

H Wang, ZJ Zha, L Li, X Chen… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

We study the problem of localizing audio-visual events that are both audible and visible in a
video. Existing works focus on encoding and aligning audio and visual features at the …

被引用次数：19 相关文章所有 5 个版本