Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Y Lei, S Yang, X Wang, L Xie - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org
Expressive synthetic speech is essential for many human-computer interaction and audio
broadcast scenarios, and thus synthesizing expressive speech has attracted much attention …

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

H Barakat, O Turk, C Demiroglu - EURASIP Journal on Audio, Speech, and …, 2024 - Springer
Speech synthesis has made significant strides thanks to the transition from machine learning
to deep learning models. Contemporary text-to-speech (TTS) models possess the capability …

MSStyleTTS: Multi-scale style modeling with hierarchical context information for expressive speech synthesis

S Lei, Y Zhou, L Chen, Z Wu, X Wu… - … /ACM Transactions on …, 2023 - ieeexplore.ieee.org
Expressive speech synthesis is crucial for many human-computer interaction scenarios,
such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the …

Dynamic Invariant‐Specific Representation Fusion Network for Multimodal Sentiment Analysis

J He, H Yanga, C Zhang, H Chen… - Computational …, 2022 - Wiley Online Library
Multimodal sentiment analysis (MSA) aims to infer emotions from linguistic, auditory, and
visual sequences. Multimodal information representation method and fusion technology are …

Towards expressive speaking style modelling with hierarchical context information for mandarin speech synthesis

S Lei, Y Zhou, L Chen, Z Wu, S Kang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Previous works on expressive speech synthesis mainly focus on current sentence. The
context in adjacent sentences is neglected, resulting in inflexible speaking style for the same …

Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis

YJ Zhang, C Zhang, W Song, Z Zhang… - … /ACM Transactions on …, 2023 - ieeexplore.ieee.org
When humans speak multiple utterances in a continuous manner, the prosodic features
generated in each utterance are related to those in its neighbouring utterances. Such cross …

MSM-VC: high-fidelity source style transfer for non-parallel voice conversion by multi-scale style modeling

Z Wang, X Wang, Q Xie, T Li, L Xie… - … /ACM Transactions on …, 2023 - ieeexplore.ieee.org
In addition to conveying the linguistic content from source speech to converted speech,
maintaining the speaking style of source speech also plays an important role in the voice …

Unsupervised multi-scale expressive speaking style modeling with hierarchical context information for audiobook speech synthesis

X Chen, S Lei, Z Wu, D Xu, W Zhao… - Proceedings of the 29th …, 2022 - aclanthology.org
Naturalness and expressiveness are crucial for audiobook speech synthesis, but now are
limited by the averaged global-scale speaking style representation. In this paper, we …

Context-aware coherent speaking style prediction with hierarchical transformers for audiobook speech synthesis

S Lei, Y Zhou, L Chen, Z Wu, S Kang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Recent advances in text-to-speech have significantly improved the expressiveness of
synthesized speech. However, it is still challenging to generate speech with contextually …

[PDF][PDF] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis.

Z Liu, NQ Wu, Y Zhang, Z Ling - INTERSPEECH, 2022 - isca-archive.org
This paper presents a method of integrating word-level style variations (WSVs) into non-
autoregressive acoustic models for speech synthesis. WSVs are discrete latent …