SampleRNN: An unconditional end-to-end neural audio generation model

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：112 相关文章所有 6 个版本

[PDF] ieee.org

Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models

S Bond-Taylor, A Leach, Y Long… - IEEE transactions on …, 2021 - ieeexplore.ieee.org

Deep generative models are a class of techniques that train deep neural networks to model
the distribution of training samples. Research has fragmented into various interconnected …

被引用次数：487 相关文章所有 12 个版本

[PDF] arxiv.org

Mamba: Linear-time sequence modeling with selective state spaces

A Gu, T Dao - arXiv preprint arXiv:2312.00752, 2023 - arxiv.org

Foundation models, now powering most of the exciting applications in deep learning, are
almost universally based on the Transformer architecture and its core attention module …

被引用次数：687 相关文章所有 7 个版本

[PDF] neurips.cc

High-fidelity audio compression with improved rvqgan

R Kumar, P Seetharaman, A Luebs… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …

被引用次数：100 相关文章所有 5 个版本

[PDF] arxiv.org

Soundstream: An end-to-end neural audio codec

N Zeghidour, A Luebs, A Omran… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org

We present SoundStream, a novel neural audio codec that can efficiently compress speech,
music and general audio at bitrates normally targeted by speech-tailored codecs …

被引用次数：448 相关文章所有 5 个版本

[PDF] arxiv.org

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

被引用次数：379 相关文章所有 2 个版本

[PDF] mlr.press

It's raw! audio generation with state-space models

K Goel, A Gu, C Donahue, C Ré - … Conference on Machine …, 2022 - proceedings.mlr.press

Developing architectures suitable for modeling raw audio is a challenging problem due to
the high sampling rates of audio waveforms. Standard sequence modeling approaches like …

被引用次数：159 相关文章所有 4 个版本

[PDF] arxiv.org

Diffwave: A versatile diffusion model for audio synthesis

Z Kong, W Ping, J Huang, K Zhao… - arXiv preprint arXiv …, 2020 - arxiv.org

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional
and unconditional waveform generation. The model is non-autoregressive, and converts the …

被引用次数：1133 相关文章所有 3 个版本

[PDF] arxiv.org

Wavegrad: Estimating gradients for waveform generation

N Chen, Y Zhang, H Zen, RJ Weiss, M Norouzi… - arXiv preprint arXiv …, 2020 - arxiv.org

This paper introduces WaveGrad, a conditional model for waveform generation which
estimates gradients of the data density. The model is built on prior work on score matching …

被引用次数：702 相关文章所有 6 个版本

[PDF] arxiv.org

Bigvgan: A universal neural vocoder with large-scale training

S Lee, W Ping, B Ginsburg, B Catanzaro… - arXiv preprint arXiv …, 2022 - arxiv.org

Despite recent progress in generative adversarial network (GAN)-based vocoders, where
the model generates raw waveform conditioned on acoustic features, it is challenging to …

被引用次数：136 相关文章所有 5 个版本