DiCLET-TTS: Diffusion model based cross-lingual emotion transfer for text-to-speech—A study between English and Mandarin

T Li, C Hu, J Cong, X Zhu, J Li, Q Tian… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
While the performance of cross-lingual TTS based on monolingual corpora has been
significantly improved recently, generating cross-lingual speech still suffers from the foreign …

Exploring the role of language families for building indic speech synthesisers

A Prakash, HA Murthy - IEEE/ACM Transactions on Audio …, 2022 - ieeexplore.ieee.org
Building end-to-end speech synthesisers for Indian languages is challenging, given the lack
of adequate clean training data and multiple grapheme representations across languages …

Unify and conquer: How phonetic feature representation affects polyglot text-to-speech (TTS)

A Sanchez, A Falai, Z Zhang, O Angelini… - arXiv preprint arXiv …, 2022 - arxiv.org
An essential design decision for multilingual Neural Text-To-Speech (NTTS) systems is how
to represent input linguistic features within the model. Looking at the wide variety of …

Mix and match: an empirical study on training corpus composition for polyglot text-to-speech (TTS)

Z Zhang, A Falai, A Sanchez, O Angelini… - arXiv preprint arXiv …, 2022 - arxiv.org
Training multilingual Neural Text-To-Speech (NTTS) models using only monolingual
corpora has emerged as a popular way for building voice cloning based Polyglot NTTS …

[HTML][HTML] Cross-lingual style transfer with conditional prior VAE and style loss

D Ratcliffe, Y Wang, A Mansbridge, P Karanasou… - 2022 - amazon.science
In this work we improve the style representation for crosslingual style transfer. Specifically,
we improve the Spanish representation across four styles, Newscaster, DJ, Excited, and …

Exploring timbre disentanglement in non-autoregressive cross-lingual text-to-speech

H Zhan, X Yu, H Zhang, Y Zhang, Y Lin - arXiv preprint arXiv:2110.07192, 2021 - arxiv.org
In this paper, we study the disentanglement of speaker and language representations in non-
autoregressive cross-lingual TTS models from various aspects. We propose a phoneme …

[HTML][HTML] Speech generation for indigenous language education

A Pine, E Cooper, D Guzmán, E Joanis… - Computer Speech & …, 2025 - Elsevier
As the quality of contemporary speech synthesis improves, so too does the interest from
language communities in developing text-to-speech (TTS) systems for a variety of real-world …

[PDF][PDF] Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis

C Tånnander, S Mehta, J Beskow… - Proc. Interspeech …, 2024 - isca-archive.org
We introduce continuous phonological features as input to TTS with the dual objective of
more precise control over phonological aspects and better potential for exploration of latent …

Few-shot cross-lingual tts using transferable phoneme embedding

WP Huang, PC Chen, SF Huang, H Lee - arXiv preprint arXiv:2206.15427, 2022 - arxiv.org
This paper studies a transferable phoneme embedding framework that aims to deal with the
cross-lingual text-to-speech (TTS) problem under the few-shot setting. Transfer learning is a …

Self-supervised learning for robust voice cloning

K Klapsas, N Ellinas, K Nikitaras… - arXiv preprint arXiv …, 2022 - arxiv.org
Voice cloning is a difficult task which requires robust and informative features incorporated
in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our …