Paratts: Learning linguistic and prosodic cross-sentence information in paragraph-based tts
Recent advancements in neural end-to-end text-to-speech (TTS) models have shown high-
quality, natural synthesized speech in a conventional sentence-based TTS. However, it is …
quality, natural synthesized speech in a conventional sentence-based TTS. However, it is …
CopyCat2: A single model for multi-speaker TTS and many-to-many fine-grained prosody transfer
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing
speech with different speaker identities, b) generating speech with expressive and …
speech with different speaker identities, b) generating speech with expressive and …
Expressive, variable, and controllable duration modelling in TTS
Duration modelling has become an important research problem once more with the rise of
non-attention neural text-to-speech systems. The current approaches largely fall back to …
non-attention neural text-to-speech systems. The current approaches largely fall back to …
[PDF][PDF] Comparing acoustic and textual representations of previous linguistic context for improving text-to-speech
P Oplustil-Gallegos, J O'Mahony… - Proc. 11th ISCA Speech …, 2021 - isca-archive.org
Text alone does not contain sufficient information to predict the spoken form. Using
additional information, such as the linguistic context, should improve Text-to-Speech …
additional information, such as the linguistic context, should improve Text-to-Speech …
Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language
End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech
from raw text. However, rendering the correct pitch accents is still a challenging problem for …
from raw text. However, rendering the correct pitch accents is still a challenging problem for …
Controllable speech synthesis by learning discrete phoneme-level prosodic representations
In this paper, we present a novel method for phoneme-level prosody control of F0 and
duration using intuitive discrete labels. We propose an unsupervised prosodic clustering …
duration using intuitive discrete labels. We propose an unsupervised prosodic clustering …
A study of modeling rising intonation in cantonese neural speech synthesis
In human speech, the attitude of a speaker cannot be fully expressed only by the textual
content. It has to come along with the intonation. Declarative questions are commonly used …
content. It has to come along with the intonation. Declarative questions are commonly used …
A learned conditional prior for the VAE acoustic space of a TTS system
Many factors influence speech yielding different renditions of a given sentence. Generative
models, such as variational autoencoders (VAEs), capture this variability and allow multiple …
models, such as variational autoencoders (VAEs), capture this variability and allow multiple …
Discourse-level prosody modeling with a variational autoencoder for non-autoregressive expressive speech synthesis
To address the issue of one-to-many mapping from phoneme sequences to acoustic
features in expressive speech synthesis, this paper proposes a method of discourse-level …
features in expressive speech synthesis, this paper proposes a method of discourse-level …
ecat: An end-to-end model for multi-speaker tts & many-to-many fine-grained prosody transfer
We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-
context speech with expressive and contextually appropriate prosody, and b) performing fine …
context speech with expressive and contextually appropriate prosody, and b) performing fine …