Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion

D Wang, L Deng, YT Yeung, X Chen, X Liu… - arXiv preprint arXiv …, 2021 - arxiv.org
One-shot voice conversion (VC), which performs conversion across arbitrary speakers with
only a single target-speaker utterance for reference, can be effectively achieved by speech …

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arXiv preprint arXiv …, 2023 - arxiv.org
The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Y Lei, S Yang, X Wang, L Xie - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org
Expressive synthetic speech is essential for many human-computer interaction and audio
broadcast scenarios, and thus synthesizing expressive speech has attracted much attention …

Privacy-preserving voice analysis via disentangled representations

R Aloufi, H Haddadi, D Boyle - Proceedings of the 2020 ACM SIGSAC …, 2020 - dl.acm.org
Voice User Interfaces (VUIs) are increasingly popular and built into smartphones, home
assistants, and Internet of Things (IoT) devices. Despite offering an always-on convenient …

Synt++: Utilizing imperfect synthetic data to improve speech recognition

TY Hu, M Armandpour, A Shrivastava… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
With recent advances in speech synthesis, synthetic data is becoming a viable alternative to
real data for training speech recognition models. However, machine learning with synthetic …

Fine-grained style control in transformer-based text-to-speech synthesis

LW Chen, A Rudnicky - ICASSP 2022-2022 IEEE International …, 2022 - ieeexplore.ieee.org
In this paper, we present a novel architecture to realize fine-grained style control on the
transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the …

Style equalization: Unsupervised learning of controllable generative sequence models

JHR Chang, A Shrivastava, H Koppula… - International …, 2022 - proceedings.mlr.press
Controllable generative sequence models with the capability to extract and replicate the
style of specific examples enable many applications, including narrating audiobooks in …

Self-supervised context-aware style representation for expressive speech synthesis

Y Wu, X Wang, S Zhang, L He, R Song… - arXiv preprint arXiv …, 2022 - arxiv.org
Expressive speech synthesis, like audiobook synthesis, is still challenging for style
representation learning and prediction. Deriving from reference audio or predicting style …

Fine-grained style modeling, transfer and prediction in text-to-speech synthesis via phone-level content-style disentanglement

D Tan, T Lee - arXiv preprint arXiv:2011.03943, 2020 - arxiv.org
This paper presents a novel design of neural network system for fine-grained style modeling,
transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling …

Speaker-independent emotional voice conversion via disentangled representations

X Chen, X Xu, J Chen, Z Zhang… - IEEE Transactions …, 2022 - ieeexplore.ieee.org
Emotional Voice Conversion (EVC) technology aims to transfer emotional state in speech
while keeping the linguistic information and speaker identity unchanged. Prior studies on …