A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Recent developments on espnet toolkit boosted by conformer

P Guo, F Boyer, X Chang, T Hayashi… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
In this study, we present recent developments on ESPnet: End-to-End Speech Processing
toolkit, which mainly involves a recently proposed architecture called Conformer …

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024 - proceedings.neurips.cc
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

High-fidelity audio compression with improved rvqgan

R Kumar, P Seetharaman, A Luebs… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …

Speak, read and prompt: High-fidelity text-to-speech with minimal supervision

E Kharitonov, D Vincent, Z Borsos… - Transactions of the …, 2023 - direct.mit.edu
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained
with minimal supervision. By combining two types of discrete speech representations, we …

Audiogpt: Understanding and generating speech, music, sound, and talking head

R Huang, M Li, D Yang, J Shi, X Chang, Z Ye… - Proceedings of the …, 2024 - ojs.aaai.org
Large language models (LLMs) have exhibited remarkable capabilities across a variety of
domains and tasks, challenging our understanding of learning and cognition. Despite the …

Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone

E Casanova, J Weber, CD Shulby… - International …, 2022 - proceedings.mlr.press
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker
TTS. Our method builds upon the VITS model and adds several novel modifications for zero …

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech

J Kim, J Kong, J Son - International Conference on Machine …, 2021 - proceedings.mlr.press
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and
parallel sampling have been proposed, but their sample quality does not match that of two …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …