An overview of deep-learning-based audio-visual speech enhancement and separation
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …
extract either one or more target speech signals, respectively, from a mixture of sounds …
Neural target speech extraction: An overview
K Zmolikova, M Delcroix, T Ochiai… - IEEE Signal …, 2023 - ieeexplore.ieee.org
Humans can listen to a target speaker even in challenging acoustic conditions that have
noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail …
noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail …
Deep audio-visual learning: A survey
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …
modalities, has drawn considerable attention since deep learning started to be used …
ADL-MVDR: All deep learning MVDR beamformer for target speech separation
Speech separation algorithms are often used to separate the target speech from other
interfering sources. However, purely neural network based speech separation systems often …
interfering sources. However, purely neural network based speech separation systems often …
Reading to listen at the cocktail party: Multi-modal speech separation
The goal of this paper is speech separation and enhancement in multi-speaker and noisy
environments using a combination of different modalities. Previous works have shown good …
environments using a combination of different modalities. Previous works have shown good …
DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement
Single-channel speech enhancement (SE) is an important task in speech processing. A
widely used framework combines an anal-ysis/synthesis filterbank with a mask prediction …
widely used framework combines an anal-ysis/synthesis filterbank with a mask prediction …
Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition
Accurate recognition of cocktail party speech containing overlapping speakers, noise and
reverberation remains a highly challenging task to date. Motivated by the invariance of …
reverberation remains a highly challenging task to date. Motivated by the invariance of …
Complex neural spatial filter: Enhancing multi-channel target speech separation in complex domain
To date, mainstream target speech separation (TSS) approaches are formulated to estimate
the complex ratio mask (cRM) of target speech in time-frequency domain under supervised …
the complex ratio mask (cRM) of target speech in time-frequency domain under supervised …
X-tf-gridnet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion
F Hao, X Li, C Zheng - Information Fusion, 2024 - Elsevier
Target speaker extraction (TSE) which has the capability to directly extract desired speech
given enrollment utterances of the target speaker has attracted more and more attention for …
given enrollment utterances of the target speaker has attracted more and more attention for …
Generalized spatio-temporal RNN beamformer for target speech separation
Although the conventional mask-based minimum variance distortionless response (MVDR)
could reduce the non-linear distortion, the residual noise level of the MVDR separated …
could reduce the non-linear distortion, the residual noise level of the MVDR separated …