Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations

C Gong, X Wang, E Cooper, D Wells… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Neural text-to-speech (TTS) has achieved humanlike synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS

D Guo, X Zhu, L Xue, T Li, Y Lv… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Recent advances in text-to-speech, particularly those based on Graph Neural Networks
(GNNs), have significantly improved the expressiveness of short-form synthetic speech …

Improving Pre-trained Model-based Speech Emotion Recognition from a Low-level Speech Feature Perspective

K Liu, J Wei, J Zou, P Wang, Y Yang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained
attention in the last two years, which shows great potential in improving the model …

PE-wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS

ZC Liu, L Chen, YJ Hu, ZH Ling… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
This paper investigates leveraging large-scale untranscribed speech data to enhance the
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

D Guo, X Zhu, L Xue, Y Zhang, W Tian, L Xie - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advances in text-to-speech have significantly improved the expressiveness of
synthetic speech. However, a major challenge remains in generating speech that captures …