Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations

C Gong, X Wang, E Cooper, D Wells… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …

Many-to-many spoken language translation via unified speech and text representation learning with unit-to-unit translation

M Kim, J Choi, D Kim, YM Ro - arXiv preprint arXiv:2308.01831, 2023 - arxiv.org
In this paper, we propose a method to learn unified representations of multilingual speech
and text with a single model, especially focusing on the purpose of speech synthesis. We …

A review on subjective and objective evaluation of synthetic speech

E Cooper, WC Huang, Y Tsao, HM Wang… - Acoustical Science …, 2024 - jstage.jst.go.jp
Evaluating synthetic speech generated by machines is a complicated process, as it involves
judging along multiple dimensions including naturalness, intelligibility, and whether the …

Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation

M Kim, J Choi, D Kim, YM Ro - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
This paper proposes a textless training method for many-to-many multilingual speech-to-
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …

Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

T Saeki, S Maiti, X Li, S Watanabe… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Neural text-to-speech (TTS) systems have made significant progress in generating natural
synthetic speech. However, neural TTS requires large amounts of paired training data …

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Y Li, A Mehrish, B Chew, B Cheng, S Poria - arXiv preprint arXiv …, 2024 - arxiv.org
Different languages have distinct phonetic systems and vary in their prosodic features
making it challenging to develop a Text-to-Speech (TTS) model that can effectively …

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

T Saeki, G Wang, N Morioka, I Elias… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Collecting high-quality studio recordings of audio is challenging, which limits the language
coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a …

PRESENT: Zero-Shot Text-to-Prosody Control

P Lam, H Zhang, NF Chen, B Sisman… - arXiv preprint arXiv …, 2024 - arxiv.org
Current strategies for achieving fine-grained prosody control in speech synthesis entail
extracting additional style embeddings or adopting more complex architectures. To enable …

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

C Gong, E Cooper, X Wang, C Qiang, M Geng… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-supervised learning (SSL) representations from massively multilingual models offer a
promising solution for low-resource language speech tasks. Despite advancements …

Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment

TH Lo, MT Tsai, B Chen - arXiv preprint arXiv:2409.07151, 2024 - arxiv.org
Second language (L2) learners can improve their pronunciation by imitating golden speech,
especially when the speech that aligns with their respective speech characteristics. This …