Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings

E Cooper, CI Lai, Y Yasuda, F Fang… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
While speaker adaptation for end-to-end speech synthesis using speaker embeddings can
produce good speaker similarity for speakers seen during training, there remains a gap for …

Hi-fi multi-speaker english tts dataset

E Bakhturina, V Lavrukhin, B Ginsburg… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper introduces a new multi-speaker English dataset for training text-to-speech
models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in …

Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding

S Choi, S Han, D Kim, S Ha - arXiv preprint arXiv:2005.08484, 2020 - arxiv.org
On account of growing demands for personalization, the need for a so-called few-shot TTS
system that clones speakers with only a few data is emerging. To address this issue, we …

From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint

Z Cai, C Zhang, M Li - arXiv preprint arXiv:2005.04587, 2020 - arxiv.org
High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent
years. However, accessing and controlling speech attributes such as speaker identity …

Speaker conditional WaveRNN: Towards universal neural vocoder for unseen speaker and recording conditions

D Paul, Y Pantazis, Y Stylianou - arXiv preprint arXiv:2008.05289, 2020 - arxiv.org
Recent advancements in deep learning led to human-level performance in single-speaker
speech synthesis. However, there are still limitations in terms of speech quality when …

Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages

H Zhang, Y Lin - arXiv preprint arXiv:2008.04549, 2020 - arxiv.org
Recently, sequence-to-sequence models with attention have been successfully applied in
Text-to-speech (TTS). These models can generate near-human speech with a large …

Can speaker augmentation improve multi-speaker end-to-end TTS?

E Cooper, CI Lai, Y Yasuda, J Yamagishi - arXiv preprint arXiv …, 2020 - arxiv.org
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in
speaker similarity. We investigate an orthogonal approach to the current speaker adaptation …

Transfer learning, style control, and speaker reconstruction loss for zero-shot multilingual multi-speaker text-to-speech on low-resource languages

K Azizah, W Jatmiko - IEEE Access, 2022 - ieeexplore.ieee.org
Deep neural network (DNN)-based systems generally require large amounts of training
data, so they have data scarcity problems in low-resource languages. Recent studies have …

[PDF][PDF] Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

Y Okamoto, K Imoto, S Takamichi… - … on Signal and …, 2022 - nowpublishers.com
In this paper, we propose a framework for environmental sound synthesis from
onomatopoeic words. As one way of expressing an environmental sound, we can use an …

Effective and direct control of neural TTS prosody by removing interactions between different attributes

X An, FK Soong, S Yang, L Xie - Neural Networks, 2021 - Elsevier
End-to-end TTS advancement has shown that synthesized speech prosody can be
controlled by conditioning the decoder with speech prosody attribute labels. However, to …