Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings
While speaker adaptation for end-to-end speech synthesis using speaker embeddings can
produce good speaker similarity for speakers seen during training, there remains a gap for …
produce good speaker similarity for speakers seen during training, there remains a gap for …
Hi-fi multi-speaker english tts dataset
E Bakhturina, V Lavrukhin, B Ginsburg… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper introduces a new multi-speaker English dataset for training text-to-speech
models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in …
models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in …
Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding
On account of growing demands for personalization, the need for a so-called few-shot TTS
system that clones speakers with only a few data is emerging. To address this issue, we …
system that clones speakers with only a few data is emerging. To address this issue, we …
From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint
High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent
years. However, accessing and controlling speech attributes such as speaker identity …
years. However, accessing and controlling speech attributes such as speaker identity …
Speaker conditional WaveRNN: Towards universal neural vocoder for unseen speaker and recording conditions
Recent advancements in deep learning led to human-level performance in single-speaker
speech synthesis. However, there are still limitations in terms of speech quality when …
speech synthesis. However, there are still limitations in terms of speech quality when …
Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages
Recently, sequence-to-sequence models with attention have been successfully applied in
Text-to-speech (TTS). These models can generate near-human speech with a large …
Text-to-speech (TTS). These models can generate near-human speech with a large …
Can speaker augmentation improve multi-speaker end-to-end TTS?
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in
speaker similarity. We investigate an orthogonal approach to the current speaker adaptation …
speaker similarity. We investigate an orthogonal approach to the current speaker adaptation …
Transfer learning, style control, and speaker reconstruction loss for zero-shot multilingual multi-speaker text-to-speech on low-resource languages
Deep neural network (DNN)-based systems generally require large amounts of training
data, so they have data scarcity problems in low-resource languages. Recent studies have …
data, so they have data scarcity problems in low-resource languages. Recent studies have …
[PDF][PDF] Onoma-to-wave: Environmental sound synthesis from onomatopoeic words
Y Okamoto, K Imoto, S Takamichi… - … on Signal and …, 2022 - nowpublishers.com
In this paper, we propose a framework for environmental sound synthesis from
onomatopoeic words. As one way of expressing an environmental sound, we can use an …
onomatopoeic words. As one way of expressing an environmental sound, we can use an …
Effective and direct control of neural TTS prosody by removing interactions between different attributes
End-to-end TTS advancement has shown that synthesized speech prosody can be
controlled by conditioning the decoder with speech prosody attribute labels. However, to …
controlled by conditioning the decoder with speech prosody attribute labels. However, to …