Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

SongCreator: Lyrics-based Universal Song Generation

S Lei, Y Zhou, B Tang, MWY Lam, F Liu, H Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Music is an integral part of human culture, embodying human intelligence and creativity, of
which songs compose an essential part. While various aspects of song generation have …

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

S Ji, J Zuo, M Fang, S Zheng, Q Chen, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Y Zhou, X Qin, Z Jin, S Zhou, S Lei, S Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent AIGC systems possess the capability to generate digital multimedia content based
on human language instructions, such as text, image and video. However, when it comes to …

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

Q Yang, J Zuo, Z Su, Z Jiang, M Li, Z Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple
Scene Speech Dataset), which is intended to provide resources for expressive speech …