Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text...

Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations

C Gong, X Wang, E Cooper, D Wells… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

Many-to-many spoken language translation via unified speech and text representation learning with unit-to-unit translation

M Kim, J Choi, D Kim, YM Ro - arXiv preprint arXiv:2308.01831, 2023 - arxiv.org

In this paper, we propose a method to learn unified representations of multilingual speech
and text with a single model, especially focusing on the purpose of speech synthesis. We …

被引用次数：17 相关文章所有 2 个版本

[PDF] jst.go.jp

A review on subjective and objective evaluation of synthetic speech

E Cooper, WC Huang, Y Tsao, HM Wang… - Acoustical Science …, 2024 - jstage.jst.go.jp

Evaluating synthetic speech generated by machines is a complicated process, as it involves
judging along multiple dimensions including naturalness, intelligibility, and whether the …

被引用次数：17 相关文章

Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation

M Kim, J Choi, D Kim, YM Ro - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org

This paper proposes a textless training method for many-to-many multilingual speech-to-
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …

被引用次数：1 相关文章所有 3 个版本

[PDF] ieee.org

Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

T Saeki, S Maiti, X Li, S Watanabe… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

Neural text-to-speech (TTS) systems have made significant progress in generating natural
synthetic speech. However, neural TTS requires large amounts of paired training data …

被引用次数：4 相关文章所有 6 个版本

[PDF] arxiv.org

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Y Li, A Mehrish, B Chew, B Cheng, S Poria - arXiv preprint arXiv …, 2024 - arxiv.org

Different languages have distinct phonetic systems and vary in their prosodic features
making it challenging to develop a Text-to-Speech (TTS) model that can effectively …

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

T Saeki, G Wang, N Morioka, I Elias… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Collecting high-quality studio recordings of audio is challenging, which limits the language
coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

PRESENT: Zero-Shot Text-to-Prosody Control

P Lam, H Zhang, NF Chen, B Sisman… - arXiv preprint arXiv …, 2024 - arxiv.org

Current strategies for achieving fine-grained prosody control in speech synthesis entail
extracting additional style embeddings or adopting more complex architectures. To enable …

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

C Gong, E Cooper, X Wang, C Qiang, M Geng… - arXiv preprint arXiv …, 2024 - arxiv.org

Self-supervised learning (SSL) representations from massively multilingual models offer a
promising solution for low-resource language speech tasks. Despite advancements …

Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment

TH Lo, MT Tsai, B Chen - arXiv preprint arXiv:2409.07151, 2024 - arxiv.org

Second language (L2) learners can improve their pronunciation by imitating golden speech,
especially when the speech that aligns with their respective speech characteristics. This …