Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions
Multimodal deep learning systems that employ multiple modalities like text, image, audio,
video, etc., are showing better performance than individual modalities (ie, unimodal) …
video, etc., are showing better performance than individual modalities (ie, unimodal) …
Learning in audio-visual context: A review, analysis, and new perspective
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …
understanding. To mimic human perception ability, audio-visual learning, aimed at …
Audio–visual segmentation
We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …
A joint cross-attention model for audio-visual fusion in dimensional emotion recognition
Multi-modal emotion recognition has recently gained much attention since it can leverage
diverse and complementary relationships over multiple modalities, such as audio, visual …
diverse and complementary relationships over multiple modalities, such as audio, visual …
Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …
producing objects within image frames and ensure the maps faithfully adheres to the given …
Contrastive positive sample propagation along the audio-visual event line
Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …
Cross-modal background suppression for audio-visual event localization
Audiovisual Event (AVE) localization requires the model to jointly localize an event by
observing audio and visual information. However, in unconstrained videos, both information …
observing audio and visual information. However, in unconstrained videos, both information …
Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention
RG Praveen, P Cardinal… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Automatic emotion recognition (ER) has recently gained much interest due to its potential in
many real-world applications. In this context, multimodal approaches have been shown to …
many real-world applications. In this context, multimodal approaches have been shown to …
Ave-clip: Audioclip-based multi-window temporal transformer for audio visual event localization
T Mahmud, D Marculescu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory
signals in a video segment. Precise localization of the AVEs is very challenging since it …
signals in a video segment. Precise localization of the AVEs is very challenging since it …
Semantic and relation modulation for audio-visual event localization
We study the problem of localizing audio-visual events that are both audible and visible in a
video. Existing works focus on encoding and aligning audio and visual features at the …
video. Existing works focus on encoding and aligning audio and visual features at the …