iEmoTTS: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre

G Zhang, Y Qin, W Zhang, J Wu, M Li… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
Cross-speaker emotion transfer is a common approach to generating emotional speech
when speech data with emotion labels from target speakers is not available. This paper …

Mixed-phoneme bert: Improving bert with mixed phoneme and sup-phoneme representations for text to speech

G Zhang, K Song, X Tan, D Tan, Y Yan, Y Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech
(TTS) has drawn increasing attention. However, the works apply pre-training with character …

[PDF][PDF] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis.

Z Liu, NQ Wu, Y Zhang, Z Ling - INTERSPEECH, 2022 - isca-archive.org
This paper presents a method of integrating word-level style variations (WSVs) into non-
autoregressive acoustic models for speech synthesis. WSVs are discrete latent …

[PDF][PDF] Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations

ZC Liu, ZH Ling, YJ Hu, J Pan, YD Wu, JW Wang - isca-archive.org
This paper presents S4LPR, a Speech Synthesis model conditioned on Self-Supervisedly
Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and …