An overview of affective speech synthesis and conversion in the deep learning era

A Triantafyllopoulos, BW Schuller… - Proceedings of the …, 2023 - ieeexplore.ieee.org
Speech is the fundamental mode of human communication, and its synthesis has long been
a core priority in human–computer interaction research. In recent years, machines have …

Emotalk: Speech-driven emotional disentanglement for 3d face animation

Z Peng, H Wu, Z Song, H Xu, X Zhu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Speech-driven 3D face animation aims to generate realistic facial expressions that match
the speech content and emotion. However, existing methods often neglect emotional facial …

Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions

G Liu, Y Zhang, Y Lei, Y Chen, R Wang, Z Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Style transfer TTS has shown impressive performance in recent years. However, style
control is often restricted to systems built on expressive speech recordings with discrete style …

DiCLET-TTS: Diffusion model based cross-lingual emotion transfer for text-to-speech—A study between English and Mandarin

T Li, C Hu, J Cong, X Zhu, J Li, Q Tian… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
While the performance of cross-lingual TTS based on monolingual corpora has been
significantly improved recently, generating cross-lingual speech still suffers from the foreign …

Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering

R Liu, B Sisman, G Gao, H Li - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a
variant of the standard version (L1), which is challenging as L2 is different from L1 in terms …

Metts: Multilingual emotional text-to-speech by cross-speaker and cross-lingual emotion transfer

X Zhu, Y Lei, T Li, Y Zhang, H Zhou… - … /ACM Transactions on …, 2024 - ieeexplore.ieee.org
Previous multilingual text-to-speech (TTS) approaches have considered leveraging
monolingual speaker data to enable cross-lingual speech synthesis. However, such data …

Mm-tts: A unified framework for multimodal, prompt-induced emotional text-to-speech synthesis

X Li, ZQ Cheng, JY He, X Peng… - arXiv preprint arXiv …, 2024 - arxiv.org
Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years
due to its potential to enhance human-computer interaction. However, current E-TTS …

Controllable accented text-to-speech synthesis

R Liu, B Sisman, G Gao, H Li - arXiv preprint arXiv:2209.10804, 2022 - arxiv.org
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a
variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different …

Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis

Y Li, X Zhu, Y Lei, H Li, J Liu, D Xie… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from
an arbitrary speech reference in the source language to the synthetic speech in the target …

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

G Zhang, T Merritt, MS Ribeiro, B Tura-Vecino… - arXiv preprint arXiv …, 2023 - arxiv.org
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong
assumptions about the distributions of the target data space. Aiming to improve those …