High-fidelity audio compression with improved rvqgan

J Copet, F Kreuk, I Gat, T Remez… - Advances in …, 2024 - proceedings.neurips.cc

We tackle the task of conditional music generation. We introduce MusicGen, a single
Language Model (LM) that operates over several streams of compressed discrete music …

被引用次数：227 相关文章所有 9 个版本

[PDF] arxiv.org

Towards audio language modeling-an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arXiv preprint arXiv …, 2024 - arxiv.org

Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

被引用次数：5 相关文章所有 2 个版本

[HTML] frontiersin.org

[HTML][HTML] A review of differentiable digital signal processing for music and speech synthesis

B Hayes, J Shier, G Fazekas, A McPherson… - Frontiers in Signal …, 2024 - frontiersin.org

The term “differentiable digital signal processing” describes a family of techniques in which
loss function gradients are backpropagated through digital signal processors, facilitating …

被引用次数：10 相关文章所有 6 个版本

[PDF] arxiv.org

Uniaudio: An audio foundation model toward universal audio generation

D Yang, J Tian, X Tan, R Huang, S Liu, X Chang… - arXiv preprint arXiv …, 2023 - arxiv.org

Language models (LMs) have demonstrated the capability to handle a variety of generative
tasks. This paper presents the UniAudio system, which, unlike prior task-specific …

被引用次数：53 相关文章所有 3 个版本

[PDF] arxiv.org

Vampnet: Music generation via masked acoustic token modeling

HF Garcia, P Seetharaman, R Kumar… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce VampNet, a masked acoustic token modeling approach to music synthesis,
compression, inpainting, and variation. We use a variable masking schedule during training …

被引用次数：39 相关文章所有 5 个版本

[PDF] arxiv.org

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

被引用次数：28 相关文章所有 4 个版本

[PDF] arxiv.org

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec

Z Du, S Zhang, K Hu, S Zheng - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org

This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an
extension of the open-source speech processing toolkit FunASR. FunCodec provides …

被引用次数：18 相关文章所有 3 个版本

Adapting frechet audio distance for generative music evaluation

A Gui, H Gamper, S Braun… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …

被引用次数：22 相关文章所有 3 个版本

[PDF] arxiv.org

Ditto: Diffusion inference-time t-optimization for music generation

Z Novack, J McAuley, T Berg-Kirkpatrick… - arXiv preprint arXiv …, 2024 - arxiv.org

We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-
work for controlling pre-trained text-to-music diffusion models at inference-time via …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

Towards universal speech discrete tokens: A case study for asr and tts

Y Yang, F Shen, C Du, Z Ma, K Yu… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …

被引用次数：9 相关文章所有 8 个版本