Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis
Expressive synthetic speech is essential for many human-computer interaction and audio
broadcast scenarios, and thus synthesizing expressive speech has attracted much attention …
broadcast scenarios, and thus synthesizing expressive speech has attracted much attention …
Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources
H Barakat, O Turk, C Demiroglu - EURASIP Journal on Audio, Speech, and …, 2024 - Springer
Speech synthesis has made significant strides thanks to the transition from machine learning
to deep learning models. Contemporary text-to-speech (TTS) models possess the capability …
to deep learning models. Contemporary text-to-speech (TTS) models possess the capability …
MSStyleTTS: Multi-scale style modeling with hierarchical context information for expressive speech synthesis
Expressive speech synthesis is crucial for many human-computer interaction scenarios,
such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the …
such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the …
Dynamic Invariant‐Specific Representation Fusion Network for Multimodal Sentiment Analysis
J He, H Yanga, C Zhang, H Chen… - Computational …, 2022 - Wiley Online Library
Multimodal sentiment analysis (MSA) aims to infer emotions from linguistic, auditory, and
visual sequences. Multimodal information representation method and fusion technology are …
visual sequences. Multimodal information representation method and fusion technology are …
Towards expressive speaking style modelling with hierarchical context information for mandarin speech synthesis
Previous works on expressive speech synthesis mainly focus on current sentence. The
context in adjacent sentences is neglected, resulting in inflexible speaking style for the same …
context in adjacent sentences is neglected, resulting in inflexible speaking style for the same …
Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis
When humans speak multiple utterances in a continuous manner, the prosodic features
generated in each utterance are related to those in its neighbouring utterances. Such cross …
generated in each utterance are related to those in its neighbouring utterances. Such cross …
MSM-VC: high-fidelity source style transfer for non-parallel voice conversion by multi-scale style modeling
In addition to conveying the linguistic content from source speech to converted speech,
maintaining the speaking style of source speech also plays an important role in the voice …
maintaining the speaking style of source speech also plays an important role in the voice …
Unsupervised multi-scale expressive speaking style modeling with hierarchical context information for audiobook speech synthesis
Naturalness and expressiveness are crucial for audiobook speech synthesis, but now are
limited by the averaged global-scale speaking style representation. In this paper, we …
limited by the averaged global-scale speaking style representation. In this paper, we …
Context-aware coherent speaking style prediction with hierarchical transformers for audiobook speech synthesis
Recent advances in text-to-speech have significantly improved the expressiveness of
synthesized speech. However, it is still challenging to generate speech with contextually …
synthesized speech. However, it is still challenging to generate speech with contextually …
[PDF][PDF] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis.
This paper presents a method of integrating word-level style variations (WSVs) into non-
autoregressive acoustic models for speech synthesis. WSVs are discrete latent …
autoregressive acoustic models for speech synthesis. WSVs are discrete latent …