An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

Deep learning-based face super-resolution: A survey

J Jiang, C Wang, X Liu, J Ma - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
Face super-resolution (FSR), also known as face hallucination, which is aimed at enhancing
the resolution of low-resolution (LR) face images to generate high-resolution face images, is …

Not only look, but also listen: Learning multimodal violence detection under weak supervision

P Wu, J Liu, Y Shi, Y Sun, F Shao, Z Wu… - Computer Vision–ECCV …, 2020 - Springer
Violence detection has been studied in computer vision for years. However, previous work
are either superficial, eg, classification of short-clips, and the single scenario, or …

Visualvoice: Audio-visual speech separation with cross-modal consistency

R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org
We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …

Sound to visual scene generation by audio-to-visual latent alignment

K Sung-Bin, A Senocak, H Ha… - Proceedings of the …, 2023 - openaccess.thecvf.com
How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …

Deep audio-visual learning: A survey

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

Audio-driven talking face video generation with learning-based personalized head pose

R Yi, Z Ye, J Zhang, H Bao, YJ Liu - arXiv preprint arXiv:2002.10137, 2020 - arxiv.org
Real-world talking faces often accompany with natural head movement. However, most
existing talking face video generation methods only consider facial animation with fixed …

Voice-face homogeneity tells deepfake

H Cheng, Y Guo, T Wang, Q Li, X Chang… - ACM Transactions on …, 2023 - dl.acm.org
Detecting forgery videos is highly desirable due to the abuse of deepfake. Existing detection
approaches contribute to exploring the specific artifacts in deepfake videos and fit well on …

Cross-modal relation-aware networks for audio-visual event localization

H Xu, R Zeng, Q Wu, M Tan, C Gan - Proceedings of the 28th ACM …, 2020 - dl.acm.org
We address the challenging task of event localization, which requires the machine to
localize an event and recognize its category in unconstrained videos. Most existing methods …

Sound-guided semantic image manipulation

SH Lee, W Roh, W Byeon, SH Yoon… - Proceedings of the …, 2022 - openaccess.thecvf.com
The recent success of the generative model shows that leveraging the multi-modal
embedding space can manipulate an image using text information. However, manipulating …