A review of deep learning techniques for speech processing
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …
learning. The use of multiple processing layers has enabled the creation of models capable …
Audioldm: Text-to-audio generation with latent diffusion models
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …
audio based on text descriptions. However, previous studies in TTA have limited generation …
Diffsound: Discrete diffusion model for text-to-sound generation
Generating sound effects that people want is an important topic. However, there are limited
studies in this area for sound generation. In this study, we investigate generating sound …
studies in this area for sound generation. In this study, we investigate generating sound …
Deblurring via stochastic refinement
Image deblurring is an ill-posed problem with multiple plausible solutions for a given input
image. However, most existing methods produce a deterministic estimate of the clean image …
image. However, most existing methods produce a deterministic estimate of the clean image …
A survey on neural speech synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …
speech given text, is a hot research topic in speech, language, and machine learning …
Riemannian score-based generative modelling
Score-based generative models (SGMs) are a powerful class of generative models that
exhibit remarkable empirical performance. Score-based generative modelling (SGM) …
exhibit remarkable empirical performance. Score-based generative modelling (SGM) …
Bigvgan: A universal neural vocoder with large-scale training
Despite recent progress in generative adversarial network (GAN)-based vocoders, where
the model generates raw waveform conditioned on acoustic features, it is challenging to …
the model generates raw waveform conditioned on acoustic features, it is challenging to …
Audioldm 2: Learning holistic audio generation with self-supervised pretraining
Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …
speech, music, and sound effects, designing models for each type requires careful …
Generating visual scenes from touch
An emerging line of work has sought to generate plausible imagery from touch. Existing
approaches, however, tackle only narrow aspects of the visuo-tactile synthesis problem, and …
approaches, however, tackle only narrow aspects of the visuo-tactile synthesis problem, and …
Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …
diffusion and adversarial training with large speech language models (SLMs) to achieve …