Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Hybrid transformers for music source separation

S Rouard, F Massa, A Défossez - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
A natural question arising in Music Source Separation (MSS) is whether long range
contextual information is useful, or whether local acoustic features are sufficient. In other …

Music source separation with band-split RNN

Y Luo, J Yu - IEEE/ACM Transactions on Audio, Speech, and …, 2023 - ieeexplore.ieee.org
The performance of music source separation (MSS) models has been greatly improved in
recent years thanks to the development of novel neural network architectures and training …

Music demixing challenge 2021

Y Mitsufuji, G Fabbro, S Uhlich, FR Stöter… - Frontiers in Signal …, 2022 - frontiersin.org
Music source separation has been intensively studied in the last decade and tremendous
progress with the advent of deep learning could be observed. Evaluation campaigns such …

Multi-source diffusion models for simultaneous music generation and separation

G Mariani, I Tallini, E Postolache, M Mancusi… - arXiv preprint arXiv …, 2023 - arxiv.org
In this work, we define a diffusion-based generative model capable of both music synthesis
and source separation by learning the score of the joint probability density of sources …

Towards low-distortion multi-channel speech enhancement: The ESPNet-SE submission to the L3DAS22 challenge

YJ Lu, S Cornell, X Chang, W Zhang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of
speech enhancement with 3D Ambisonic microphones. The core of our approach combines …

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

J Hwang, M Hira, C Chen, X Zhang, Z Ni… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims
to accelerate the research and development of audio and speech technologies by providing …

Transfer learning of wav2vec 2.0 for automatic lyric transcription

L Ou, X Gu, Y Wang - arXiv preprint arXiv:2207.09747, 2022 - arxiv.org
Automatic speech recognition (ASR) has progressed significantly in recent years due to the
emergence of large-scale datasets and the self-supervised learning (SSL) paradigm …

The Sound Demixing Challenge 2023$\unicode {x2013} $ Music Demixing Track

G Fabbro, S Uhlich, CH Lai, W Choi… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge
(SDX'23). We provide a summary of the challenge setup and introduce the task of robust …

Aero: Audio super resolution in the spectral domain

M Mandel, O Tal, Y Adi - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
We present AERO, a audio super-resolution model that processes speech and music
signals in the spectral domain. AERO is based on an encoder-decoder architecture with …