An overview of affective speech synthesis and conversion in the deep learning era
A Triantafyllopoulos, BW Schuller… - Proceedings of the …, 2023 - ieeexplore.ieee.org
Speech is the fundamental mode of human communication, and its synthesis has long been
a core priority in human–computer interaction research. In recent years, machines have …
a core priority in human–computer interaction research. In recent years, machines have …
Emotalk: Speech-driven emotional disentanglement for 3d face animation
Speech-driven 3D face animation aims to generate realistic facial expressions that match
the speech content and emotion. However, existing methods often neglect emotional facial …
the speech content and emotion. However, existing methods often neglect emotional facial …
Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions
Style transfer TTS has shown impressive performance in recent years. However, style
control is often restricted to systems built on expressive speech recordings with discrete style …
control is often restricted to systems built on expressive speech recordings with discrete style …
DiCLET-TTS: Diffusion model based cross-lingual emotion transfer for text-to-speech—A study between English and Mandarin
While the performance of cross-lingual TTS based on monolingual corpora has been
significantly improved recently, generating cross-lingual speech still suffers from the foreign …
significantly improved recently, generating cross-lingual speech still suffers from the foreign …
Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a
variant of the standard version (L1), which is challenging as L2 is different from L1 in terms …
variant of the standard version (L1), which is challenging as L2 is different from L1 in terms …
Metts: Multilingual emotional text-to-speech by cross-speaker and cross-lingual emotion transfer
Previous multilingual text-to-speech (TTS) approaches have considered leveraging
monolingual speaker data to enable cross-lingual speech synthesis. However, such data …
monolingual speaker data to enable cross-lingual speech synthesis. However, such data …
Mm-tts: A unified framework for multimodal, prompt-induced emotional text-to-speech synthesis
Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years
due to its potential to enhance human-computer interaction. However, current E-TTS …
due to its potential to enhance human-computer interaction. However, current E-TTS …
Controllable accented text-to-speech synthesis
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a
variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different …
variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different …
Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis
Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from
an arbitrary speech reference in the source language to the synthetic speech in the target …
an arbitrary speech reference in the source language to the synthetic speech in the target …
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong
assumptions about the distributions of the target data space. Aiming to improve those …
assumptions about the distributions of the target data space. Aiming to improve those …