A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need?
As ChatGPT goes viral, generative AI (AIGC, aka AI-generated content) has made headlines
everywhere because of its ability to analyze and create text, images, and beyond. With such …
everywhere because of its ability to analyze and create text, images, and beyond. With such …
An overview of deep-learning-based audio-visual speech enhancement and separation
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …
extract either one or more target speech signals, respectively, from a mixture of sounds …
Gan inversion: A survey
GAN inversion aims to invert a given image back into the latent space of a pretrained GAN
model so that the image can be faithfully reconstructed from the inverted code by the …
model so that the image can be faithfully reconstructed from the inverted code by the …
Visual speech recognition for multiple languages in the wild
Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …
movements, without relying on the audio stream. Advances in deep learning and the …
Conditional generation of audio from video via foley analogies
The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …
Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis
K Yang, D Marković, S Krenn… - Proceedings of the …, 2022 - openaccess.thecvf.com
Since facial actions such as lip movements contain significant information about speech
content, it is not surprising that audio-visual speech enhancement methods are more …
content, it is not surprising that audio-visual speech enhancement methods are more …
Multi-modality associative bridging through memory: Speech sound recollected from face video
In this paper, we introduce a novel audio-visual multi-modal bridging framework that can
utilize both audio and visual information, even with uni-modal inputs. We exploit a memory …
utilize both audio and visual information, even with uni-modal inputs. We exploit a memory …
Lip to speech synthesis with visual context attentional gan
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual
Context Attentional GAN (VCA-GAN), which can jointly model local and global lip …
Context Attentional GAN (VCA-GAN), which can jointly model local and global lip …
Liplearner: Customizable silent speech interactions on mobile devices
Silent speech interface is a promising technology that enables private communications in
natural language. However, previous approaches only support a small and inflexible …
natural language. However, previous approaches only support a small and inflexible …
End-to-end video-to-speech synthesis using generative adversarial networks
Video-to-speech is the process of reconstructing the audio speech from a video of a spoken
utterance. Previous approaches to this task have relied on a two-step process where an …
utterance. Previous approaches to this task have relied on a two-step process where an …