Imaginary voice: Face-styled diffusion model for text-to-speech
The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices
learnt from facial characteristics. Inspired by the natural fact that people can imagine the …
learnt from facial characteristics. Inspired by the natural fact that people can imagine the …
Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training
for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio …
for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio …
Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
The goal of this work is to simultaneously generate natural talking faces and speech outputs
from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech …
from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech …
Residual-guided personalized speech synthesis based on face image
J Wang, Z Wang, X Hu, X Li, Q Fang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Previous works derive personalized speech features by training the model on a large
dataset composed of his/her audio sounds. It was reported that face information has a strong …
dataset composed of his/her audio sounds. It was reported that face information has a strong …
Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image
We present Foice, a novel deepfake attack against voice authentication systems. Foice
generates a synthetic voice of the victim from just a single image of the victim's face, without …
generates a synthetic voice of the victim from just a single image of the victim's face, without …
MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style
information into text content to generate corresponding speech with a specific style …
information into text content to generate corresponding speech with a specific style …
Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment
This paper presents a novel task, zero-shot voice conversion based on face images (zero-
shot FaceVC), which aims at converting the voice characteristics of an utterance from any …
shot FaceVC), which aims at converting the voice characteristics of an utterance from any …
M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing
Dubbing refers to the procedure of recording characters by professional voice actors in films
and games. It is more expressive and immersive than conventional Text-to-Speech (TTS) …
and games. It is more expressive and immersive than conventional Text-to-Speech (TTS) …
Zero-shot face-based voice conversion: bottleneck-free speech disentanglement in the real-world scenario
Often a face has a voice. Appearance sometimes has a strong relationship with one's voice.
In this work, we study how a face can be converted to a voice, which is a face-based voice …
In this work, we study how a face can be converted to a voice, which is a face-based voice …