Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models

YA Li, C Han, V Raghavan… - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …

Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling

R Liu, Y Hu, Y Ren, X Yin, H Li - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the
appropriate prosody and emotional inflection within a conversational setting. While …

Towards human-like spoken dialogue generation between AI agents from written dialogue

K Mitsui, Y Hono, K Sawada - arXiv preprint arXiv:2310.01088, 2023 - arxiv.org
The advent of large language models (LLMs) has made it possible to generate natural
written dialogues between two agents. However, generating human-like spoken dialogues …

Pheme: Efficient and Conversational Speech Generation

P Budzianowski, T Sereda, T Cichy, I Vulić - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, speech generation has seen remarkable progress, now achieving one-shot
generation capability that is often virtually indistinguishable from real human voice …

Cmcu-css: Enhancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis

Y Deng, J Xue, F Wang, Y Gao, Y Li - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral
communication. However, the complexity of context dependency modeling poses significant …

Concss: Contrastive-based context comprehension for dialogue-appropriate prosody in conversational speech synthesis

Y Deng, J Xue, Y Jia, Q Li, Y Han… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary
information with the aim of generating speech that has dialogue-appropriate prosody. While …

MuLanTTS The Microsoft Speech Synthesis System for Blizzard Challenge 2023

Z Xu, S Zhang, X Wang, J Zhang, W Wei, L He… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS)
system designed for the Blizzard Challenge 2023. About 50 hours of audiobook corpus for …

PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS

ZC Liu, L Chen, YJ Hu, ZH Ling… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
This paper investigates leveraging large-scale untranscribed speech data to enhance the
prosody modelling capability of text-to-speech (TTS) models. On the basis of the self …

Generative Expressive Conversational Speech Synthesis

R Liu, Y Hu, R Yi, Y Xiang, H Li - arXiv preprint arXiv:2407.21491, 2024 - arxiv.org
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper
speaking style in a user-agent conversation setting. Existing CSS methods employ effective …

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

R Liu, Z Jia, J Yang, Y Hu, H Li - arXiv preprint arXiv:2410.09524, 2024 - arxiv.org
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the
appropriate style within a conversational setting, which attracts more attention nowadays …