Prosodic representation learning and contextual sampling for neural text-to-speech

L Xue, FK Soong, S Zhang, L Xie - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org

Recent advancements in neural end-to-end text-to-speech (TTS) models have shown high-
quality, natural synthesized speech in a conventional sentence-based TTS. However, it is …

被引用次数：24 相关文章所有 4 个版本

[PDF] arxiv.org

CopyCat2: A single model for multi-speaker TTS and many-to-many fine-grained prosody transfer

S Karlapati, P Karanasou, M Lajszczak… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing
speech with different speaker identities, b) generating speech with expressive and …

被引用次数：15 相关文章所有 6 个版本

[PDF] arxiv.org

Expressive, variable, and controllable duration modelling in TTS

A Abbas, T Merritt, A Moinet, S Karlapati… - arXiv preprint arXiv …, 2022 - arxiv.org

Duration modelling has become an important research problem once more with the rise of
non-attention neural text-to-speech systems. The current approaches largely fall back to …

被引用次数：14 相关文章所有 9 个版本

[PDF] isca-archive.org

[PDF][PDF] Comparing acoustic and textual representations of previous linguistic context for improving text-to-speech

P Oplustil-Gallegos, J O'Mahony… - Proc. 11th ISCA Speech …, 2021 - isca-archive.org

Text alone does not contain sufficient information to predict the spoken form. Using
additional information, such as the linguistic context, should improve Text-to-Speech …

被引用次数：14 相关文章所有 4 个版本

[PDF] ieee.org

Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language

Y Yasuda, T Toda - IEEE Journal of Selected Topics in Signal …, 2022 - ieeexplore.ieee.org

End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech
from raw text. However, rendering the correct pitch accents is still a challenging problem for …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

N Ellinas, M Christidou, A Vioni, JS Sung… - Speech …, 2023 - Elsevier

In this paper, we present a novel method for phoneme-level prosody control of F0 and
duration using intuitive discrete labels. We propose an unsupervised prosodic clustering …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

A study of modeling rising intonation in cantonese neural speech synthesis

Q Bai, T Ko, Y Zhang - arXiv preprint arXiv:2208.02189, 2022 - arxiv.org

In human speech, the attitude of a speaker cannot be fully expressed only by the textual
content. It has to come along with the intonation. Declarative questions are commonly used …

被引用次数：4 相关文章所有 7 个版本

[PDF] arxiv.org

A learned conditional prior for the VAE acoustic space of a TTS system

P Karanasou, S Karlapati, A Moinet, A Joly… - arXiv preprint arXiv …, 2021 - arxiv.org

Many factors influence speech yielding different renditions of a given sentence. Generative
models, such as variational autoencoders (VAEs), capture this variability and allow multiple …

被引用次数：9 相关文章所有 11 个版本

Discourse-level prosody modeling with a variational autoencoder for non-autoregressive expressive speech synthesis

NQ Wu, ZC Liu, ZH Ling - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org

To address the issue of one-to-many mapping from phoneme sequences to acoustic
features in expressive speech synthesis, this paper proposes a method of discourse-level …

被引用次数：4 相关文章

[PDF] arxiv.org

ecat: An end-to-end model for multi-speaker tts & many-to-many fine-grained prosody transfer

A Abbas, S Karlapati, B Schnell, P Karanasou… - arXiv preprint arXiv …, 2023 - arxiv.org

We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-
context speech with expressive and contextually appropriate prosody, and b) performing fine …

被引用次数：4 相关文章所有 6 个版本