Speech2face: Learning the face behind a voice

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org

Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

被引用次数：275 相关文章所有 6 个版本

[PDF] acm.org

Deep learning-based face super-resolution: A survey

J Jiang, C Wang, X Liu, J Ma - ACM Computing Surveys (CSUR), 2021 - dl.acm.org

Face super-resolution (FSR), also known as face hallucination, which is aimed at enhancing
the resolution of low-resolution (LR) face images to generate high-resolution face images, is …

被引用次数：138 相关文章所有 6 个版本

[PDF] arxiv.org

Not only look, but also listen: Learning multimodal violence detection under weak supervision

P Wu, J Liu, Y Shi, Y Sun, F Shao, Z Wu… - Computer Vision–ECCV …, 2020 - Springer

Violence detection has been studied in computer vision for years. However, previous work
are either superficial, eg, classification of short-clips, and the single scenario, or …

被引用次数：324 相关文章所有 6 个版本

[PDF] arxiv.org

Visualvoice: Audio-visual speech separation with cross-modal consistency

R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org

We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …

被引用次数：184 相关文章所有 9 个版本

[PDF] thecvf.com

Sound to visual scene generation by audio-to-visual latent alignment

K Sung-Bin, A Senocak, H Ha… - Proceedings of the …, 2023 - openaccess.thecvf.com

How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …

被引用次数：28 相关文章所有 6 个版本

[PDF] springer.com

Deep audio-visual learning: A survey

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer

Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

被引用次数：179 相关文章所有 12 个版本

[PDF] arxiv.org

Audio-driven talking face video generation with learning-based personalized head pose

R Yi, Z Ye, J Zhang, H Bao, YJ Liu - arXiv preprint arXiv:2002.10137, 2020 - arxiv.org

Real-world talking faces often accompany with natural head movement. However, most
existing talking face video generation methods only consider facial animation with fixed …

被引用次数：162 相关文章所有 2 个版本

[PDF] arxiv.org

Voice-face homogeneity tells deepfake

H Cheng, Y Guo, T Wang, Q Li, X Chang… - ACM Transactions on …, 2023 - dl.acm.org

Detecting forgery videos is highly desirable due to the abuse of deepfake. Existing detection
approaches contribute to exploring the specific artifacts in deepfake videos and fit well on …

被引用次数：66 相关文章所有 4 个版本

[PDF] github.io

Cross-modal relation-aware networks for audio-visual event localization

H Xu, R Zeng, Q Wu, M Tan, C Gan - Proceedings of the 28th ACM …, 2020 - dl.acm.org

We address the challenging task of event localization, which requires the machine to
localize an event and recognize its category in unconstrained videos. Most existing methods …

被引用次数：87 相关文章所有 3 个版本

[PDF] thecvf.com

Sound-guided semantic image manipulation

SH Lee, W Roh, W Byeon, SH Yoon… - Proceedings of the …, 2022 - openaccess.thecvf.com

The recent success of the generative model shows that leveraging the multi-modal
embedding space can manipulate an image using text information. However, manipulating …

被引用次数：55 相关文章所有 9 个版本