Multi-speaker end-to-end speech synthesis

E Cooper, CI Lai, Y Yasuda, F Fang… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org

While speaker adaptation for end-to-end speech synthesis using speaker embeddings can
produce good speaker similarity for speakers seen during training, there remains a gap for …

被引用次数：197 相关文章所有 7 个版本

[PDF] arxiv.org

Hi-fi multi-speaker english tts dataset

E Bakhturina, V Lavrukhin, B Ginsburg… - arXiv preprint arXiv …, 2021 - arxiv.org

This paper introduces a new multi-speaker English dataset for training text-to-speech
models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in …

被引用次数：83 相关文章所有 6 个版本

[PDF] arxiv.org

Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding

S Choi, S Han, D Kim, S Ha - arXiv preprint arXiv:2005.08484, 2020 - arxiv.org

On account of growing demands for personalization, the need for a so-called few-shot TTS
system that clones speakers with only a few data is emerging. To address this issue, we …

被引用次数：74 相关文章所有 9 个版本

[PDF] arxiv.org

From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint

Z Cai, C Zhang, M Li - arXiv preprint arXiv:2005.04587, 2020 - arxiv.org

High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent
years. However, accessing and controlling speech attributes such as speaker identity …

被引用次数：45 相关文章所有 9 个版本

[PDF] arxiv.org

Speaker conditional WaveRNN: Towards universal neural vocoder for unseen speaker and recording conditions

D Paul, Y Pantazis, Y Stylianou - arXiv preprint arXiv:2008.05289, 2020 - arxiv.org

Recent advancements in deep learning led to human-level performance in single-speaker
speech synthesis. However, there are still limitations in terms of speech quality when …

被引用次数：36 相关文章所有 8 个版本

[PDF] arxiv.org

Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages

H Zhang, Y Lin - arXiv preprint arXiv:2008.04549, 2020 - arxiv.org

Recently, sequence-to-sequence models with attention have been successfully applied in
Text-to-speech (TTS). These models can generate near-human speech with a large …

被引用次数：35 相关文章所有 8 个版本

[PDF] arxiv.org

Can speaker augmentation improve multi-speaker end-to-end TTS?

E Cooper, CI Lai, Y Yasuda, J Yamagishi - arXiv preprint arXiv …, 2020 - arxiv.org

Previous work on speaker adaptation for end-to-end speech synthesis still falls short in
speaker similarity. We investigate an orthogonal approach to the current speaker adaptation …

被引用次数：27 相关文章所有 9 个版本

[PDF] ieee.org

Transfer learning, style control, and speaker reconstruction loss for zero-shot multilingual multi-speaker text-to-speech on low-resource languages

K Azizah, W Jatmiko - IEEE Access, 2022 - ieeexplore.ieee.org

Deep neural network (DNN)-based systems generally require large amounts of training
data, so they have data scarcity problems in low-resource languages. Recent studies have …

被引用次数：10 相关文章所有 4 个版本

[PDF] nowpublishers.com

[PDF][PDF] Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

Y Okamoto, K Imoto, S Takamichi… - … on Signal and …, 2022 - nowpublishers.com

In this paper, we propose a framework for environmental sound synthesis from
onomatopoeic words. As one way of expressing an environmental sound, we can use an …

被引用次数：14 相关文章所有 7 个版本

Effective and direct control of neural TTS prosody by removing interactions between different attributes

X An, FK Soong, S Yang, L Xie - Neural Networks, 2021 - Elsevier

End-to-end TTS advancement has shown that synthesized speech prosody can be
controlled by conditioning the decoder with speech prosody attribute labels. However, to …

被引用次数：8 相关文章所有 4 个版本