Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss
E Georgiou, K Kritsis… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Deep learning Text-to-Speech (TTS) systems have achieved impressive generated speech
quality, close to human parity. However, they suffer from training stability issues and in …
quality, close to human parity. However, they suffer from training stability issues and in …
Semi-supervised training for improving data efficiency in end-to-end speech synthesis
Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent
results, they typically require a sizable set of high-quality< text, audio> pairs for training …
results, they typically require a sizable set of high-quality< text, audio> pairs for training …
Pre-alignment guided attention for improving training efficiency and model stability in end-to-end speech synthesis
Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun
to surpass the traditional multi-stage hand-engineered systems, with both simplified system …
to surpass the traditional multi-stage hand-engineered systems, with both simplified system …
Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments
Y Yasuda, X Wang, J Yamagishi - arXiv preprint arXiv:1908.11535, 2019 - arxiv.org
End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to
output acoustic features using a single network. A recent advance of end-to-end TTS is due …
output acoustic features using a single network. A recent advance of end-to-end TTS is due …
[PDF][PDF] Wavetts: Tacotron-based tts with joint time-frequency domain loss
Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input.
Such frameworks typically consist of a feature prediction network that maps character …
Such frameworks typically consist of a feature prediction network that maps character …
JETS: Jointly training FastSpeech2 and HiFi-GAN for end to end text to speech
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models
have shown synthesis quality close to human speech. For example, FastSpeech2 transforms …
have shown synthesis quality close to human speech. For example, FastSpeech2 transforms …
Parallel tacotron: Non-autoregressive and controllable tts
Although neural end-to-end text-to-speech models can synthesize highly natural speech,
there is still room for improvements to its efficiency and naturalness. This paper proposes a …
there is still room for improvements to its efficiency and naturalness. This paper proposes a …
Vara-tts: Non-autoregressive text-to-speech synthesis based on very deep vae with residual attention
P Liu, Y Cao, S Liu, N Hu, G Li, C Weng… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model
using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism …
using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism …
Devicetts: A small-footprint, fast, stable network for on-device text-to-speech
With the number of smart devices increasing, the demand for on-device text-to-speech (TTS)
increases rapidly. In recent years, many prominent End-to-End TTS methods have been …
increases rapidly. In recent years, many prominent End-to-End TTS methods have been …
Diff-tts: A denoising diffusion model for text-to-speech
Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded
in generating human-like speech, there is still room for improvements to its naturalness and …
in generating human-like speech, there is still room for improvements to its naturalness and …