A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Y Kumar, A Koul, C Singh - Multimedia Tools and Applications, 2023 - Springer
Text-to-speech systems (TTS) have come a long way in the last decade and are now a
popular research topic for creating various human-computer interaction systems. Although, a …

Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models

YA Li, C Han, V Raghavan… - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …

Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt

D Yang, S Liu, R Huang, C Weng… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Expressive text-to-speech (TTS) aims to synthesize speech with varying speaking styles to
better reflect human speech patterns. In this study, we attempt to use natural language as a …

Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder

C Du, Q Chen, T He, X Tan, X Chen, K Yu… - Proceedings of the 31st …, 2023 - dl.acm.org
While recent research has made significant progress in speech-driven talking face
generation, the quality of the generated video still lags behind that of real recordings. One …

HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis

SH Lee, SB Kim, JH Lee, E Song… - Advances in Neural …, 2022 - proceedings.neurips.cc
This paper presents HierSpeech, a high-quality end-to-end text-to-speech (TTS) system
based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised …

UniCATS: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding

C Du, Y Guo, F Shen, Z Liu, Z Liang, X Chen… - Proceedings of the …, 2024 - ojs.aaai.org
The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens,
has been proven superior to traditional acoustic feature mel-spectrograms in terms of …

A vector quantized approach for text to speech synthesis on real-world spontaneous speech

LW Chen, S Watanabe, A Rudnicky - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Abstract Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have
achieved near human-level naturalness. The diversity of human speech, however, often …