Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Naturalspeech: End-to-end text-to-speech synthesis with human-level quality

X Tan, J Chen, H Liu, J Cong, C Zhang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …

Fastspeech 2: Fast and high-quality end-to-end text to speech

Y Ren, C Hu, X Tan, T Qin, S Zhao, Z Zhao… - arXiv preprint arXiv …, 2020 - arxiv.org
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize
speech significantly faster than previous autoregressive models with comparable quality …

Prodiff: Progressive fast diffusion model for high-quality text-to-speech

R Huang, Z Zhao, H Liu, J Liu, C Cui… - Proceedings of the 30th …, 2022 - dl.acm.org
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading
performances in many generative tasks. However, the inherited iterative sampling process …

Fastspeech: Fast, robust and controllable text to speech

Y Ren, Y Ruan, X Tan, T Qin, S Zhao… - Advances in neural …, 2019 - proceedings.neurips.cc
Neural network based end-to-end text to speech (TTS) has significantly improved the quality
of synthesized speech. Prominent methods (eg, Tacotron 2) usually first generate mel …

Adaspeech: Adaptive text to speech for custom voice

M Chen, X Tan, B Li, Y Liu, T Qin, S Zhao… - arXiv preprint arXiv …, 2021 - arxiv.org
Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims
to adapt a source TTS model to synthesize personal voice for a target speaker using few …

Prompttts: Controllable text-to-speech with text descriptions

Z Guo, Y Leng, Y Wu, S Zhao… - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Using a text description as prompt to guide the generation of text or images (eg, GPT-3 or
DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this …

Portaspeech: Portable and high-quality generative text-to-speech

Y Ren, J Liu, Z Zhao - Advances in Neural Information …, 2021 - proceedings.neurips.cc
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS
can synthesize high-quality speech from the given text in parallel. After analyzing two kinds …

Lrspeech: Extremely low-resource speech synthesis and recognition

J Xu, X Tan, Y Ren, T Qin, J Li, S Zhao… - Proceedings of the 26th …, 2020 - dl.acm.org
Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR)
are important speech tasks, and require a large amount of text and speech pairs for model …