A review of deep learning techniques for speech processing
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …
learning. The use of multiple processing layers has enabled the creation of models capable …
Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models
Deep generative models are a class of techniques that train deep neural networks to model
the distribution of training samples. Research has fragmented into various interconnected …
the distribution of training samples. Research has fragmented into various interconnected …
Mamba: Linear-time sequence modeling with selective state spaces
Foundation models, now powering most of the exciting applications in deep learning, are
almost universally based on the Transformer architecture and its core attention module …
almost universally based on the Transformer architecture and its core attention module …
High-fidelity audio compression with improved rvqgan
Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …
images, speech, and music. A key component of these models is a high quality neural …
Soundstream: An end-to-end neural audio codec
We present SoundStream, a novel neural audio codec that can efficiently compress speech,
music and general audio at bitrates normally targeted by speech-tailored codecs …
music and general audio at bitrates normally targeted by speech-tailored codecs …
A survey on neural speech synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …
speech given text, is a hot research topic in speech, language, and machine learning …
It's raw! audio generation with state-space models
Developing architectures suitable for modeling raw audio is a challenging problem due to
the high sampling rates of audio waveforms. Standard sequence modeling approaches like …
the high sampling rates of audio waveforms. Standard sequence modeling approaches like …
Diffwave: A versatile diffusion model for audio synthesis
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional
and unconditional waveform generation. The model is non-autoregressive, and converts the …
and unconditional waveform generation. The model is non-autoregressive, and converts the …
Wavegrad: Estimating gradients for waveform generation
This paper introduces WaveGrad, a conditional model for waveform generation which
estimates gradients of the data density. The model is built on prior work on score matching …
estimates gradients of the data density. The model is built on prior work on score matching …
Bigvgan: A universal neural vocoder with large-scale training
Despite recent progress in generative adversarial network (GAN)-based vocoders, where
the model generates raw waveform conditioned on acoustic features, it is challenging to …
the model generates raw waveform conditioned on acoustic features, it is challenging to …