Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted...

J Lee, JS Chung, SW Chung - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org

The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices
learnt from facial characteristics. Inspired by the natural fact that people can imagine the …

被引用次数：21 相关文章所有 6 个版本

[PDF] ieee.org

Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading

L Qu, C Weber, S Wermter - IEEE transactions on neural …, 2022 - ieeexplore.ieee.org

The aim of this work is to investigate the impact of crossmodal self-supervised pre-training
for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio …

被引用次数：19 相关文章所有 9 个版本

[PDF] thecvf.com

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Y Jang, JH Kim, J Ahn, D Kwak… - Proceedings of the …, 2024 - openaccess.thecvf.com

The goal of this work is to simultaneously generate natural talking faces and speech outputs
from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

Residual-guided personalized speech synthesis based on face image

J Wang, Z Wang, X Hu, X Li, Q Fang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org

Previous works derive personalized speech features by training the model on a large
dataset composed of his/her audio sounds. It was reported that face information has a strong …

被引用次数：15 相关文章所有 4 个版本

[PDF] usenix.org

Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image

N Jiang, B Sun, T Sim, J Han - 33rd USENIX Security Symposium …, 2024 - usenix.org

We present Foice, a novel deepfake attack against voice authentication systems. Foice
generates a synthetic voice of the victim from just a single image of the victim's face, without …

被引用次数：1 相关文章所有 2 个版本

[PDF] aaai.org

MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis

W Guan, Y Li, T Li, H Huang, F Wang, J Lin… - Proceedings of the …, 2024 - ojs.aaai.org

The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style
information into text content to generate corresponding speech with a specific style …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

ZY Sheng, Y Ai, YN Chen, ZH Ling - Proceedings of the 31st ACM …, 2023 - dl.acm.org

This paper presents a novel task, zero-shot voice conversion based on face images (zero-
shot FaceVC), which aims at converting the voice characteristics of an utterance from any …

被引用次数：3 相关文章所有 3 个版本

M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing

Y Liu, LF Wei, X Qian, TH Zhang, SL Chen… - Pattern Recognition …, 2024 - Elsevier

Dubbing refers to the procedure of recording characters by professional voice actors in films
and games. It is more expressive and immersive than conventional Text-to-Speech (TTS) …

被引用次数：2 相关文章所有 3 个版本

Speech synthesis with face embeddings

X Wu, S Ji, J Wang, Y Guo - Applied Intelligence, 2022 - Springer

Human beings are capable of imagining a person's voice according to his or her
appearance because different people have different voice characteristics. Although …

被引用次数：7 相关文章所有 4 个版本

[PDF] aaai.org

Zero-shot face-based voice conversion: bottleneck-free speech disentanglement in the real-world scenario

SE Weng, HH Shuai, WH Cheng - … of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org

Often a face has a voice. Appearance sometimes has a strong relationship with one's voice.
In this work, we study how a face can be converted to a voice, which is a face-based voice …

被引用次数：1 相关文章所有 3 个版本