Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …
Many-to-many spoken language translation via unified speech and text representation learning with unit-to-unit translation
In this paper, we propose a method to learn unified representations of multilingual speech
and text with a single model, especially focusing on the purpose of speech synthesis. We …
and text with a single model, especially focusing on the purpose of speech synthesis. We …
A review on subjective and objective evaluation of synthetic speech
Evaluating synthetic speech generated by machines is a complicated process, as it involves
judging along multiple dimensions including naturalness, intelligibility, and whether the …
judging along multiple dimensions including naturalness, intelligibility, and whether the …
Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation
This paper proposes a textless training method for many-to-many multilingual speech-to-
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis
Neural text-to-speech (TTS) systems have made significant progress in generating natural
synthetic speech. However, neural TTS requires large amounts of paired training data …
synthetic speech. However, neural TTS requires large amounts of paired training data …
Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation
Different languages have distinct phonetic systems and vary in their prosodic features
making it challenging to develop a Text-to-Speech (TTS) model that can effectively …
making it challenging to develop a Text-to-Speech (TTS) model that can effectively …
Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data
Collecting high-quality studio recordings of audio is challenging, which limits the language
coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a …
coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a …
PRESENT: Zero-Shot Text-to-Prosody Control
Current strategies for achieving fine-grained prosody control in speech synthesis entail
extracting additional style embeddings or adopting more complex architectures. To enable …
extracting additional style embeddings or adopting more complex architectures. To enable …
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
Self-supervised learning (SSL) representations from massively multilingual models offer a
promising solution for low-resource language speech tasks. Despite advancements …
promising solution for low-resource language speech tasks. Despite advancements …
Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment
Second language (L2) learners can improve their pronunciation by imitating golden speech,
especially when the speech that aligns with their respective speech characteristics. This …
especially when the speech that aligns with their respective speech characteristics. This …