A deep learning approaches in text-to-speech system: a systematic review and recent research perspective
Text-to-speech systems (TTS) have come a long way in the last decade and are now a
popular research topic for creating various human-computer interaction systems. Although, a …
popular research topic for creating various human-computer interaction systems. Although, a …
Neural codec language models are zero-shot text to speech synthesizers
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …
we train a neural codec language model (called Vall-E) using discrete codes derived from …
Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
Foundation models for music: A survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …
diffusion and adversarial training with large speech language models (SLMs) to achieve …
Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt
Expressive text-to-speech (TTS) aims to synthesize speech with varying speaking styles to
better reflect human speech patterns. In this study, we attempt to use natural language as a …
better reflect human speech patterns. In this study, we attempt to use natural language as a …
Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder
While recent research has made significant progress in speech-driven talking face
generation, the quality of the generated video still lags behind that of real recordings. One …
generation, the quality of the generated video still lags behind that of real recordings. One …
HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis
This paper presents HierSpeech, a high-quality end-to-end text-to-speech (TTS) system
based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised …
based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised …
UniCATS: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding
The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens,
has been proven superior to traditional acoustic feature mel-spectrograms in terms of …
has been proven superior to traditional acoustic feature mel-spectrograms in terms of …
A vector quantized approach for text to speech synthesis on real-world spontaneous speech
Abstract Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have
achieved near human-level naturalness. The diversity of human speech, however, often …
achieved near human-level naturalness. The diversity of human speech, however, often …