A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need?

C Zhang, C Zhang, S Zheng, Y Qiao, C Li… - arXiv preprint arXiv …, 2023 - arxiv.org
As ChatGPT goes viral, generative AI (AIGC, aka AI-generated content) has made headlines
everywhere because of its ability to analyze and create text, images, and beyond. With such …

An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

Gan inversion: A survey

W Xia, Y Zhang, Y Yang, JH Xue… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
GAN inversion aims to invert a given image back into the latent space of a pretrained GAN
model so that the image can be faithfully reconstructed from the inverted code by the …

Visual speech recognition for multiple languages in the wild

P Ma, S Petridis, M Pantic - Nature Machine Intelligence, 2022 - nature.com
Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …

Conditional generation of audio from video via foley analogies

Y Du, Z Chen, J Salamon, B Russell… - Proceedings of the …, 2023 - openaccess.thecvf.com
The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …

Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis

K Yang, D Marković, S Krenn… - Proceedings of the …, 2022 - openaccess.thecvf.com
Since facial actions such as lip movements contain significant information about speech
content, it is not surprising that audio-visual speech enhancement methods are more …

Multi-modality associative bridging through memory: Speech sound recollected from face video

M Kim, J Hong, SJ Park, YM Ro - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
In this paper, we introduce a novel audio-visual multi-modal bridging framework that can
utilize both audio and visual information, even with uni-modal inputs. We exploit a memory …

Lip to speech synthesis with visual context attentional gan

M Kim, J Hong, YM Ro - Advances in Neural Information …, 2021 - proceedings.neurips.cc
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual
Context Attentional GAN (VCA-GAN), which can jointly model local and global lip …

Liplearner: Customizable silent speech interactions on mobile devices

Z Su, S Fang, J Rekimoto - Proceedings of the 2023 CHI Conference on …, 2023 - dl.acm.org
Silent speech interface is a promising technology that enables private communications in
natural language. However, previous approaches only support a small and inflexible …

End-to-end video-to-speech synthesis using generative adversarial networks

R Mira, K Vougioukas, P Ma, S Petridis… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Video-to-speech is the process of reconstructing the audio speech from a video of a spoken
utterance. Previous approaches to this task have relied on a two-step process where an …