Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss

E Georgiou, K Kritsis… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Deep learning Text-to-Speech (TTS) systems have achieved impressive generated speech
quality, close to human parity. However, they suffer from training stability issues and in …

Semi-supervised training for improving data efficiency in end-to-end speech synthesis

YA Chung, Y Wang, WN Hsu, Y Zhang… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent
results, they typically require a sizable set of high-quality< text, audio> pairs for training …

Pre-alignment guided attention for improving training efficiency and model stability in end-to-end speech synthesis

X Zhu, Y Zhang, S Yang, L Xue, L Xie - IEEE Access, 2019 - ieeexplore.ieee.org
Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun
to surpass the traditional multi-stage hand-engineered systems, with both simplified system …

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

Y Yasuda, X Wang, J Yamagishi - arXiv preprint arXiv:1908.11535, 2019 - arxiv.org
End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to
output acoustic features using a single network. A recent advance of end-to-end TTS is due …

[PDF][PDF] Wavetts: Tacotron-based tts with joint time-frequency domain loss

R Liu, B Sisman, F Bao, G Gao, H Li - arXiv preprint arXiv …, 2020 - researchgate.net
Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input.
Such frameworks typically consist of a feature prediction network that maps character …

JETS: Jointly training FastSpeech2 and HiFi-GAN for end to end text to speech

D Lim, S Jung, E Kim - arXiv preprint arXiv:2203.16852, 2022 - arxiv.org
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models
have shown synthesis quality close to human speech. For example, FastSpeech2 transforms …

Parallel tacotron: Non-autoregressive and controllable tts

I Elias, H Zen, J Shen, Y Zhang, Y Jia… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Although neural end-to-end text-to-speech models can synthesize highly natural speech,
there is still room for improvements to its efficiency and naturalness. This paper proposes a …

Vara-tts: Non-autoregressive text-to-speech synthesis based on very deep vae with residual attention

P Liu, Y Cao, S Liu, N Hu, G Li, C Weng… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model
using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism …

Devicetts: A small-footprint, fast, stable network for on-device text-to-speech

Z Huang, H Li, M Lei - arXiv preprint arXiv:2010.15311, 2020 - arxiv.org
With the number of smart devices increasing, the demand for on-device text-to-speech (TTS)
increases rapidly. In recent years, many prominent End-to-End TTS methods have been …

Diff-tts: A denoising diffusion model for text-to-speech

M Jeong, H Kim, SJ Cheon, BJ Choi, NS Kim - arXiv preprint arXiv …, 2021 - arxiv.org
Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded
in generating human-like speech, there is still room for improvements to its naturalness and …