An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

Vovit: Low latency graph-based audio-visual voice separation transformer

JF Montesinos, VS Kadandale, G Haro - European Conference on …, 2022 - Springer
This paper presents an audio-visual approach for voice separation which produces state-of-
the-art results at a low latency in two scenarios: speech and singing voice. The model is …

Srtnet: Time domain speech enhancement via stochastic refinement

Z Qiu, M Fu, Y Yu, LL Yin, F Sun… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Diffusion model, as a new generative model which is very popular in image generation and
audio synthesis, is rarely used in speech enhancement. In this paper, we use the diffusion …

Se-bridge: Speech enhancement with consistent brownian bridge

Z Qiu, M Fu, F Sun, G Altenbek, H Huang - arXiv preprint arXiv:2305.13796, 2023 - arxiv.org
We propose SE-Bridge, a novel method for speech enhancement (SE). After recently
applying the diffusion models to speech enhancement, we can achieve speech …

Audio–text retrieval based on contrastive learning and collaborative attention mechanism

T Hu, X Xiang, J Qin, Y Tan - Multimedia Systems, 2023 - Springer
Existing research on audio–text retrieval is limited by the size of the dataset and the structure
of the network, making it difficult to learn the ideal features of audio and text resulting in low …

Audio-visual speech enhancement with a deep kalman filter generative model

A Golmakani, M Sadeghi… - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Deep latent variable generative models based on variational autoencoder (VAE) have
shown promising performance for audio-visual speech enhancement (AVSE). The …

Driver identification using deep generative model with limited data

H Hu, J Liu, G Chen, Y Zhao, Z Gao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The scarcity of driving data constrains the accuracy of deep learning (DL)-based driver
identification methods in practical application scenarios. To address this issue, this study …

Switching variational auto-encoders for noise-agnostic audio-visual speech enhancement

M Sadeghi, X Alameda-Pineda - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
Recently, audio-visual speech enhancement has been tackled in the unsupervised settings
based on variational auto-encoders (VAEs), where during training only clean data is used to …

Public-private Attributes-based Variational Adversarial Network for Audio-Visual Cross-Modal Matching

A Zheng, F Yuan, H Zhang, J Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Existing audio-visual cross-modal matching methods focus on mitigating cross-modal
heterogeneity but ignore the impact of intra-class discrepancy of the same identity in …

Deep variational generative models for audio-visual speech separation

VN Nguyen, M Sadeghi, E Ricci… - 2021 IEEE 31st …, 2021 - ieeexplore.ieee.org
In this paper, we are interested in audio-visual speech separation given a single-channel
audio recording as well as visual information (lips movements) associated with each …