Learning individual speaking styles for accurate lip to speech synthesis

C Zhang, C Zhang, S Zheng, Y Qiao, C Li… - arXiv preprint arXiv …, 2023 - arxiv.org

As ChatGPT goes viral, generative AI (AIGC, aka AI-generated content) has made headlines
everywhere because of its ability to analyze and create text, images, and beyond. With such …

被引用次数：177 相关文章所有 4 个版本

[PDF] arxiv.org

An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org

Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

被引用次数：263 相关文章所有 6 个版本

[PDF] arxiv.org

Gan inversion: A survey

W Xia, Y Zhang, Y Yang, JH Xue… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

GAN inversion aims to invert a given image back into the latent space of a pretrained GAN
model so that the image can be faithfully reconstructed from the inverted code by the …

被引用次数：561 相关文章所有 13 个版本

[PDF] arxiv.org

Visual speech recognition for multiple languages in the wild

P Ma, S Petridis, M Pantic - Nature Machine Intelligence, 2022 - nature.com

Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …

被引用次数：115 相关文章所有 7 个版本

[PDF] thecvf.com

Conditional generation of audio from video via foley analogies

Y Du, Z Chen, J Salamon, B Russell… - Proceedings of the …, 2023 - openaccess.thecvf.com

The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …

被引用次数：24 相关文章所有 7 个版本

[PDF] thecvf.com

Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis

K Yang, D Marković, S Krenn… - Proceedings of the …, 2022 - openaccess.thecvf.com

Since facial actions such as lip movements contain significant information about speech
content, it is not surprising that audio-visual speech enhancement methods are more …

被引用次数：37 相关文章所有 5 个版本

[PDF] thecvf.com

Multi-modality associative bridging through memory: Speech sound recollected from face video

M Kim, J Hong, SJ Park, YM Ro - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

In this paper, we introduce a novel audio-visual multi-modal bridging framework that can
utilize both audio and visual information, even with uni-modal inputs. We exploit a memory …

被引用次数：45 相关文章所有 8 个版本

[PDF] neurips.cc

Lip to speech synthesis with visual context attentional gan

M Kim, J Hong, YM Ro - Advances in Neural Information …, 2021 - proceedings.neurips.cc

In this paper, we propose a novel lip-to-speech generative adversarial network, Visual
Context Attentional GAN (VCA-GAN), which can jointly model local and global lip …

被引用次数：41 相关文章所有 9 个版本

[PDF] arxiv.org

Liplearner: Customizable silent speech interactions on mobile devices

Z Su, S Fang, J Rekimoto - Proceedings of the 2023 CHI Conference on …, 2023 - dl.acm.org

Silent speech interface is a promising technology that enables private communications in
natural language. However, previous approaches only support a small and inflexible …

被引用次数：20 相关文章所有 6 个版本

[PDF] arxiv.org

End-to-end video-to-speech synthesis using generative adversarial networks

R Mira, K Vougioukas, P Ma, S Petridis… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Video-to-speech is the process of reconstructing the audio speech from a video of a spoken
utterance. Previous approaches to this task have relied on a two-step process where an …

被引用次数：51 相关文章所有 6 个版本