Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
A survey on neural speech synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …
speech given text, is a hot research topic in speech, language, and machine learning …
Naturalspeech: End-to-end text-to-speech synthesis with human-level quality
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …
years. Some questions naturally arise that whether a TTS system can achieve human-level …
Fastspeech 2: Fast and high-quality end-to-end text to speech
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize
speech significantly faster than previous autoregressive models with comparable quality …
speech significantly faster than previous autoregressive models with comparable quality …
Prodiff: Progressive fast diffusion model for high-quality text-to-speech
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading
performances in many generative tasks. However, the inherited iterative sampling process …
performances in many generative tasks. However, the inherited iterative sampling process …
Fastspeech: Fast, robust and controllable text to speech
Neural network based end-to-end text to speech (TTS) has significantly improved the quality
of synthesized speech. Prominent methods (eg, Tacotron 2) usually first generate mel …
of synthesized speech. Prominent methods (eg, Tacotron 2) usually first generate mel …
Adaspeech: Adaptive text to speech for custom voice
Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims
to adapt a source TTS model to synthesize personal voice for a target speaker using few …
to adapt a source TTS model to synthesize personal voice for a target speaker using few …
Prompttts: Controllable text-to-speech with text descriptions
Using a text description as prompt to guide the generation of text or images (eg, GPT-3 or
DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this …
DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this …
Portaspeech: Portable and high-quality generative text-to-speech
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS
can synthesize high-quality speech from the given text in parallel. After analyzing two kinds …
can synthesize high-quality speech from the given text in parallel. After analyzing two kinds …
Lrspeech: Extremely low-resource speech synthesis and recognition
Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR)
are important speech tasks, and require a large amount of text and speech pairs for model …
are important speech tasks, and require a large amount of text and speech pairs for model …