Imaginary voice: Face-styled diffusion model for text-to-speech

J Lee, JS Chung, SW Chung - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices
learnt from facial characteristics. Inspired by the natural fact that people can imagine the …

Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading

L Qu, C Weber, S Wermter - IEEE transactions on neural …, 2022 - ieeexplore.ieee.org
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training
for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio …

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Y Jang, JH Kim, J Ahn, D Kwak… - Proceedings of the …, 2024 - openaccess.thecvf.com
The goal of this work is to simultaneously generate natural talking faces and speech outputs
from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech …

Residual-guided personalized speech synthesis based on face image

J Wang, Z Wang, X Hu, X Li, Q Fang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Previous works derive personalized speech features by training the model on a large
dataset composed of his/her audio sounds. It was reported that face information has a strong …

Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image

N Jiang, B Sun, T Sim, J Han - 33rd USENIX Security Symposium …, 2024 - usenix.org
We present Foice, a novel deepfake attack against voice authentication systems. Foice
generates a synthetic voice of the victim from just a single image of the victim's face, without …

MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis

W Guan, Y Li, T Li, H Huang, F Wang, J Lin… - Proceedings of the …, 2024 - ojs.aaai.org
The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style
information into text content to generate corresponding speech with a specific style …

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

ZY Sheng, Y Ai, YN Chen, ZH Ling - Proceedings of the 31st ACM …, 2023 - dl.acm.org
This paper presents a novel task, zero-shot voice conversion based on face images (zero-
shot FaceVC), which aims at converting the voice characteristics of an utterance from any …

M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing

Y Liu, LF Wei, X Qian, TH Zhang, SL Chen… - Pattern Recognition …, 2024 - Elsevier
Dubbing refers to the procedure of recording characters by professional voice actors in films
and games. It is more expressive and immersive than conventional Text-to-Speech (TTS) …

Speech synthesis with face embeddings

X Wu, S Ji, J Wang, Y Guo - Applied Intelligence, 2022 - Springer
Human beings are capable of imagining a person's voice according to his or her
appearance because different people have different voice characteristics. Although …

Zero-shot face-based voice conversion: bottleneck-free speech disentanglement in the real-world scenario

SE Weng, HH Shuai, WH Cheng - … of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org
Often a face has a voice. Appearance sometimes has a strong relationship with one's voice.
In this work, we study how a face can be converted to a voice, which is a face-based voice …