iEmoTTS: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre
Cross-speaker emotion transfer is a common approach to generating emotional speech
when speech data with emotion labels from target speakers is not available. This paper …
when speech data with emotion labels from target speakers is not available. This paper …
Mixed-phoneme bert: Improving bert with mixed phoneme and sup-phoneme representations for text to speech
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech
(TTS) has drawn increasing attention. However, the works apply pre-training with character …
(TTS) has drawn increasing attention. However, the works apply pre-training with character …
[PDF][PDF] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis.
This paper presents a method of integrating word-level style variations (WSVs) into non-
autoregressive acoustic models for speech synthesis. WSVs are discrete latent …
autoregressive acoustic models for speech synthesis. WSVs are discrete latent …
[PDF][PDF] Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations
This paper presents S4LPR, a Speech Synthesis model conditioned on Self-Supervisedly
Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and …
Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and …