Simple and controllable music generation
We tackle the task of conditional music generation. We introduce MusicGen, a single
Language Model (LM) that operates over several streams of compressed discrete music …
Language Model (LM) that operates over several streams of compressed discrete music …
Towards audio language modeling-an overview
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …
reduce transmission latency. Researchers recently discovered the potential of codecs as …
[HTML][HTML] A review of differentiable digital signal processing for music and speech synthesis
The term “differentiable digital signal processing” describes a family of techniques in which
loss function gradients are backpropagated through digital signal processors, facilitating …
loss function gradients are backpropagated through digital signal processors, facilitating …
Uniaudio: An audio foundation model toward universal audio generation
Language models (LMs) have demonstrated the capability to handle a variety of generative
tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
Vampnet: Music generation via masked acoustic token modeling
We introduce VampNet, a masked acoustic token modeling approach to music synthesis,
compression, inpainting, and variation. We use a variable masking schedule during training …
compression, inpainting, and variation. We use a variable masking schedule during training …
Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an
extension of the open-source speech processing toolkit FunASR. FunCodec provides …
extension of the open-source speech processing toolkit FunASR. FunCodec provides …
Adapting frechet audio distance for generative music evaluation
The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …
Ditto: Diffusion inference-time t-optimization for music generation
We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-
work for controlling pre-trained text-to-music diffusion models at inference-time via …
work for controlling pre-trained text-to-music diffusion models at inference-time via …
Towards universal speech discrete tokens: A case study for asr and tts
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …