Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …
diffusion and adversarial training with large speech language models (SLMs) to achieve …
Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the
appropriate prosody and emotional inflection within a conversational setting. While …
appropriate prosody and emotional inflection within a conversational setting. While …
Towards human-like spoken dialogue generation between AI agents from written dialogue
The advent of large language models (LLMs) has made it possible to generate natural
written dialogues between two agents. However, generating human-like spoken dialogues …
written dialogues between two agents. However, generating human-like spoken dialogues …
Pheme: Efficient and Conversational Speech Generation
In recent years, speech generation has seen remarkable progress, now achieving one-shot
generation capability that is often virtually indistinguishable from real human voice …
generation capability that is often virtually indistinguishable from real human voice …
Cmcu-css: Enhancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis
Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral
communication. However, the complexity of context dependency modeling poses significant …
communication. However, the complexity of context dependency modeling poses significant …
Concss: Contrastive-based context comprehension for dialogue-appropriate prosody in conversational speech synthesis
Y Deng, J Xue, Y Jia, Q Li, Y Han… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary
information with the aim of generating speech that has dialogue-appropriate prosody. While …
information with the aim of generating speech that has dialogue-appropriate prosody. While …
MuLanTTS The Microsoft Speech Synthesis System for Blizzard Challenge 2023
In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS)
system designed for the Blizzard Challenge 2023. About 50 hours of audiobook corpus for …
system designed for the Blizzard Challenge 2023. About 50 hours of audiobook corpus for …
PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS
This paper investigates leveraging large-scale untranscribed speech data to enhance the
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …
Generative Expressive Conversational Speech Synthesis
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper
speaking style in a user-agent conversation setting. Existing CSS methods employ effective …
speaking style in a user-agent conversation setting. Existing CSS methods employ effective …
Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling
R Liu, Z Jia, J Yang, Y Hu, H Li - arXiv preprint arXiv:2410.09524, 2024 - arxiv.org
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the
appropriate style within a conversational setting, which attracts more attention nowadays …
appropriate style within a conversational setting, which attracts more attention nowadays …