MSStyleTTS: Multi-scale style modeling with hierarchical context information for expressive speech synthesis
Expressive speech synthesis is crucial for many human-computer interaction scenarios,
such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the …
such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the …
Context-aware coherent speaking style prediction with hierarchical transformers for audiobook speech synthesis
Recent advances in text-to-speech have significantly improved the expressiveness of
synthesized speech. However, it is still challenging to generate speech with contextually …
synthesized speech. However, it is still challenging to generate speech with contextually …
PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS
This paper investigates leveraging large-scale untranscribed speech data to enhance the
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …
[PDF][PDF] Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations
This paper presents S4LPR, a Speech Synthesis model conditioned on Self-Supervisedly
Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and …
Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and …