Simple and controllable music generation

J Copet, F Kreuk, I Gat, T Remez… - Advances in …, 2024 - proceedings.neurips.cc
We tackle the task of conditional music generation. We introduce MusicGen, a single
Language Model (LM) that operates over several streams of compressed discrete music …

Towards audio language modeling-an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arXiv preprint arXiv …, 2024 - arxiv.org
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

[HTML][HTML] A review of differentiable digital signal processing for music and speech synthesis

B Hayes, J Shier, G Fazekas, A McPherson… - Frontiers in Signal …, 2024 - frontiersin.org
The term “differentiable digital signal processing” describes a family of techniques in which
loss function gradients are backpropagated through digital signal processors, facilitating …

Uniaudio: An audio foundation model toward universal audio generation

D Yang, J Tian, X Tan, R Huang, S Liu, X Chang… - arXiv preprint arXiv …, 2023 - arxiv.org
Language models (LMs) have demonstrated the capability to handle a variety of generative
tasks. This paper presents the UniAudio system, which, unlike prior task-specific …

Vampnet: Music generation via masked acoustic token modeling

HF Garcia, P Seetharaman, R Kumar… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce VampNet, a masked acoustic token modeling approach to music synthesis,
compression, inpainting, and variation. We use a variable masking schedule during training …

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec

Z Du, S Zhang, K Hu, S Zheng - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an
extension of the open-source speech processing toolkit FunASR. FunCodec provides …

Adapting frechet audio distance for generative music evaluation

A Gui, H Gamper, S Braun… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …

Ditto: Diffusion inference-time t-optimization for music generation

Z Novack, J McAuley, T Berg-Kirkpatrick… - arXiv preprint arXiv …, 2024 - arxiv.org
We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-
work for controlling pre-trained text-to-music diffusion models at inference-time via …

Towards universal speech discrete tokens: A case study for asr and tts

Y Yang, F Shen, C Du, Z Ma, K Yu… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …