Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
Hybrid transformers for music source separation
A natural question arising in Music Source Separation (MSS) is whether long range
contextual information is useful, or whether local acoustic features are sufficient. In other …
contextual information is useful, or whether local acoustic features are sufficient. In other …
Music source separation with band-split RNN
The performance of music source separation (MSS) models has been greatly improved in
recent years thanks to the development of novel neural network architectures and training …
recent years thanks to the development of novel neural network architectures and training …
Music demixing challenge 2021
Music source separation has been intensively studied in the last decade and tremendous
progress with the advent of deep learning could be observed. Evaluation campaigns such …
progress with the advent of deep learning could be observed. Evaluation campaigns such …
Multi-source diffusion models for simultaneous music generation and separation
In this work, we define a diffusion-based generative model capable of both music synthesis
and source separation by learning the score of the joint probability density of sources …
and source separation by learning the score of the joint probability density of sources …
Towards low-distortion multi-channel speech enhancement: The ESPNet-SE submission to the L3DAS22 challenge
This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of
speech enhancement with 3D Ambisonic microphones. The core of our approach combines …
speech enhancement with 3D Ambisonic microphones. The core of our approach combines …
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims
to accelerate the research and development of audio and speech technologies by providing …
to accelerate the research and development of audio and speech technologies by providing …
Transfer learning of wav2vec 2.0 for automatic lyric transcription
Automatic speech recognition (ASR) has progressed significantly in recent years due to the
emergence of large-scale datasets and the self-supervised learning (SSL) paradigm …
emergence of large-scale datasets and the self-supervised learning (SSL) paradigm …
The Sound Demixing Challenge 2023$\unicode {x2013} $ Music Demixing Track
This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge
(SDX'23). We provide a summary of the challenge setup and introduce the task of robust …
(SDX'23). We provide a summary of the challenge setup and introduce the task of robust …
Aero: Audio super resolution in the spectral domain
We present AERO, a audio super-resolution model that processes speech and music
signals in the spectral domain. AERO is based on an encoder-decoder architecture with …
signals in the spectral domain. AERO is based on an encoder-decoder architecture with …