Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
SongCreator: Lyrics-based Universal Song Generation
Music is an integral part of human culture, embodying human intelligence and creativity, of
which songs compose an essential part. While various aspects of song generation have …
which songs compose an essential part. While various aspects of song generation have …
FlashSpeech: Efficient Zero-Shot Speech Synthesis
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …
by language models and diffusion models. However, the generation process of both …
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling
Recent AIGC systems possess the capability to generate digital multimedia content based
on human language instructions, such as text, image and video. However, when it comes to …
on human language instructions, such as text, image and video. However, when it comes to …
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple
Scene Speech Dataset), which is intended to provide resources for expressive speech …
Scene Speech Dataset), which is intended to provide resources for expressive speech …