Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

A comprehensive survey on video saliency detection with auditory information: the audio-visual consistency perceptual is the key!

C Chen, M Song, W Song, L Guo… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Video saliency detection (VSD) aims at fast locating the most attractive
objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied …

Video saliency forecasting transformer

C Ma, H Sun, Y Rao, J Zhou, J Lu - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Video saliency prediction (VSP) aims to imitate eye fixations of humans. However, the
potential of this task has not been fully exploited since existing VSP methods only focus on …

Transformer-based multi-scale feature integration network for video saliency prediction

X Zhou, S Wu, R Shi, B Zheng, S Wang… - … on Circuits and …, 2023 - ieeexplore.ieee.org
Most cutting-edge video saliency prediction models rely on spatiotemporal features
extracted by 3D convolutions due to its local contextual cues acquirement ability. However …

Spatio-temporal self-attention network for video saliency prediction

Z Wang, Z Liu, G Li, Y Wang, T Zhang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
3D convolutional neural networks have achieved promising results for video tasks in
computer vision, including video saliency prediction that is explored in this paper. However …

CASP-Net: Rethinking video saliency prediction from an audio-visual consistency perceptual perspective

J Xiong, G Wang, P Zhang, W Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the
selective attention mechanism of human brain. By focusing on the benefits of joint auditory …

ECANet: Explicit cyclic attention-based network for video saliency prediction

H Xue, M Sun, Y Liang - Neurocomputing, 2022 - Elsevier
Video saliency prediction has received increasing attention in the field of computer vision
research. How to model the spatio-temporal information in video frames is a key issue for …

Joint learning of audio–visual saliency prediction and sound source localization on multi-face videos

M Qiao, Y Liu, M Xu, X Deng, B Li, W Hu… - International Journal of …, 2024 - Springer
Visual and audio events simultaneously occur and both attract attention. However, most
existing saliency prediction works ignore the influence of audio and only consider vision …

Multi-scale spatiotemporal feature fusion network for video saliency prediction

Y Zhang, T Zhang, C Wu, R Tao - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Recently, video saliency prediction has attracted increasing attention, yet the improvement
of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To …

CAD-contextual multi-modal alignment for dynamic AVQA

A Nadeem, A Hilton, R Dawes… - Proceedings of the …, 2024 - openaccess.thecvf.com
In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual
modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing …