MSStyleTTS: Multi-scale style modeling with hierarchical context information for expressive speech synthesis

S Lei, Y Zhou, L Chen, Z Wu, X Wu… - … /ACM Transactions on …, 2023 - ieeexplore.ieee.org
Expressive speech synthesis is crucial for many human-computer interaction scenarios,
such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the …

Context-aware coherent speaking style prediction with hierarchical transformers for audiobook speech synthesis

S Lei, Y Zhou, L Chen, Z Wu, S Kang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Recent advances in text-to-speech have significantly improved the expressiveness of
synthesized speech. However, it is still challenging to generate speech with contextually …

PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS

ZC Liu, L Chen, YJ Hu, ZH Ling… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
This paper investigates leveraging large-scale untranscribed speech data to enhance the
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …

[PDF][PDF] Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations

ZC Liu, ZH Ling, YJ Hu, J Pan, YD Wu, JW Wang - isca-archive.org
This paper presents S4LPR, a Speech Synthesis model conditioned on Self-Supervisedly
Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and …

[引用][C] 面向虚实融合的人机交互

陶建华, 龚江涛, 高楠, 傅四维, 梁山, 喻纯 - 中国图象图形学报, 2023