Paratts: Learning linguistic and prosodic cross-sentence information in paragraph-based tts

L Xue, FK Soong, S Zhang, L Xie - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org
Recent advancements in neural end-to-end text-to-speech (TTS) models have shown high-
quality, natural synthesized speech in a conventional sentence-based TTS. However, it is …

CopyCat2: A single model for multi-speaker TTS and many-to-many fine-grained prosody transfer

S Karlapati, P Karanasou, M Lajszczak… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing
speech with different speaker identities, b) generating speech with expressive and …

Expressive, variable, and controllable duration modelling in TTS

A Abbas, T Merritt, A Moinet, S Karlapati… - arXiv preprint arXiv …, 2022 - arxiv.org
Duration modelling has become an important research problem once more with the rise of
non-attention neural text-to-speech systems. The current approaches largely fall back to …

[PDF][PDF] Comparing acoustic and textual representations of previous linguistic context for improving text-to-speech

P Oplustil-Gallegos, J O'Mahony… - Proc. 11th ISCA Speech …, 2021 - isca-archive.org
Text alone does not contain sufficient information to predict the spoken form. Using
additional information, such as the linguistic context, should improve Text-to-Speech …

Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language

Y Yasuda, T Toda - IEEE Journal of Selected Topics in Signal …, 2022 - ieeexplore.ieee.org
End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech
from raw text. However, rendering the correct pitch accents is still a challenging problem for …

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

N Ellinas, M Christidou, A Vioni, JS Sung… - Speech …, 2023 - Elsevier
In this paper, we present a novel method for phoneme-level prosody control of F0 and
duration using intuitive discrete labels. We propose an unsupervised prosodic clustering …

A study of modeling rising intonation in cantonese neural speech synthesis

Q Bai, T Ko, Y Zhang - arXiv preprint arXiv:2208.02189, 2022 - arxiv.org
In human speech, the attitude of a speaker cannot be fully expressed only by the textual
content. It has to come along with the intonation. Declarative questions are commonly used …

A learned conditional prior for the VAE acoustic space of a TTS system

P Karanasou, S Karlapati, A Moinet, A Joly… - arXiv preprint arXiv …, 2021 - arxiv.org
Many factors influence speech yielding different renditions of a given sentence. Generative
models, such as variational autoencoders (VAEs), capture this variability and allow multiple …

Discourse-level prosody modeling with a variational autoencoder for non-autoregressive expressive speech synthesis

NQ Wu, ZC Liu, ZH Ling - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
To address the issue of one-to-many mapping from phoneme sequences to acoustic
features in expressive speech synthesis, this paper proposes a method of discourse-level …

ecat: An end-to-end model for multi-speaker tts & many-to-many fine-grained prosody transfer

A Abbas, S Karlapati, B Schnell, P Karanasou… - arXiv preprint arXiv …, 2023 - arxiv.org
We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-
context speech with expressive and contextually appropriate prosody, and b) performing fine …