Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion
One-shot voice conversion (VC), which performs conversion across arbitrary speakers with
only a single target-speaker utterance for reference, can be effectively achieved by speech …
only a single target-speaker utterance for reference, can be effectively achieved by speech …
Transformers in speech processing: A survey
The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …
sparked the interest of the speech-processing community, leading to an exploration of their …
Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis
Expressive synthetic speech is essential for many human-computer interaction and audio
broadcast scenarios, and thus synthesizing expressive speech has attracted much attention …
broadcast scenarios, and thus synthesizing expressive speech has attracted much attention …
Privacy-preserving voice analysis via disentangled representations
Voice User Interfaces (VUIs) are increasingly popular and built into smartphones, home
assistants, and Internet of Things (IoT) devices. Despite offering an always-on convenient …
assistants, and Internet of Things (IoT) devices. Despite offering an always-on convenient …
Synt++: Utilizing imperfect synthetic data to improve speech recognition
With recent advances in speech synthesis, synthetic data is becoming a viable alternative to
real data for training speech recognition models. However, machine learning with synthetic …
real data for training speech recognition models. However, machine learning with synthetic …
Fine-grained style control in transformer-based text-to-speech synthesis
LW Chen, A Rudnicky - ICASSP 2022-2022 IEEE International …, 2022 - ieeexplore.ieee.org
In this paper, we present a novel architecture to realize fine-grained style control on the
transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the …
transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the …
Style equalization: Unsupervised learning of controllable generative sequence models
Controllable generative sequence models with the capability to extract and replicate the
style of specific examples enable many applications, including narrating audiobooks in …
style of specific examples enable many applications, including narrating audiobooks in …
Self-supervised context-aware style representation for expressive speech synthesis
Expressive speech synthesis, like audiobook synthesis, is still challenging for style
representation learning and prediction. Deriving from reference audio or predicting style …
representation learning and prediction. Deriving from reference audio or predicting style …
Fine-grained style modeling, transfer and prediction in text-to-speech synthesis via phone-level content-style disentanglement
D Tan, T Lee - arXiv preprint arXiv:2011.03943, 2020 - arxiv.org
This paper presents a novel design of neural network system for fine-grained style modeling,
transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling …
transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling …
Speaker-independent emotional voice conversion via disentangled representations
X Chen, X Xu, J Chen, Z Zhang… - IEEE Transactions …, 2022 - ieeexplore.ieee.org
Emotional Voice Conversion (EVC) technology aims to transfer emotional state in speech
while keeping the linguistic information and speaker identity unchanged. Prior studies on …
while keeping the linguistic information and speaker identity unchanged. Prior studies on …