Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations
Neural text-to-speech (TTS) has achieved humanlike synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …
HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS
Recent advances in text-to-speech, particularly those based on Graph Neural Networks
(GNNs), have significantly improved the expressiveness of short-form synthetic speech …
(GNNs), have significantly improved the expressiveness of short-form synthetic speech …
Improving Pre-trained Model-based Speech Emotion Recognition from a Low-level Speech Feature Perspective
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained
attention in the last two years, which shows great potential in improving the model …
attention in the last two years, which shows great potential in improving the model …
PE-wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS
This paper investigates leveraging large-scale untranscribed speech data to enhance the
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …
Text-aware and Context-aware Expressive Audiobook Speech Synthesis
Recent advances in text-to-speech have significantly improved the expressiveness of
synthetic speech. However, a major challenge remains in generating speech that captures …
synthetic speech. However, a major challenge remains in generating speech that captures …