An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

A better use of audio-visual cues: Dense video captioning with bi-modal transformer

V Iashin, E Rahtu - arXiv preprint arXiv:2005.08271, 2020 - arxiv.org
Dense video captioning aims to localize and describe important events in untrimmed videos.
Existing methods mainly tackle this task by exploiting only visual features, while completely …

Taming visually guided sound generation

V Iashin, E Rahtu - arXiv preprint arXiv:2110.08791, 2021 - arxiv.org
Recent advances in visually-induced audio generation are based on sampling short, low-
fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the …

Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds

E Tzinis, S Wisdom, A Jansen, S Hershey… - arXiv preprint arXiv …, 2020 - arxiv.org
Recent progress in deep learning has enabled many advances in sound separation and
visual scene understanding. However, extracting sound sources which are apparent in …

iquery: Instruments as queries for audio-visual sound separation

J Chen, R Zhang, D Lian, J Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Current audio-visual separation methods share a standard architecture design where an
audio encoder-decoder network is fused with visual encoding features at the encoder …

Recent advances and challenges in deep audio-visual correlation learning

L Vilaça, Y Yu, P Viana - arXiv preprint arXiv:2202.13673, 2022 - arxiv.org
Audio-visual correlation learning aims to capture essential correspondences and
understand natural phenomena between audio and video. With the rapid growth of deep …

Lavss: Location-guided audio-visual spatial audio separation

Y Ye, W Yang, Y Tian - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com
Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …

Visually guided sound source separation and localization using self-supervised motion representations

L Zhu, E Rahtu - Proceedings of the IEEE/CVF Winter …, 2022 - openaccess.thecvf.com
In this paper, we perform audio-visual sound source separation, ie to separate component
audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint …

V-slowfast network for efficient visual sound separation

L Zhu, E Rahtu - Proceedings of the IEEE/CVF Winter …, 2022 - openaccess.thecvf.com
The objective of this paper is to perform visual sound separation: i) we study visual sound
separation on spectrograms of different temporal resolutions; ii) we propose a new light yet …

A cappella: Audio-visual singing voice separation

JF Montesinos, VS Kadandale, G Haro - arXiv preprint arXiv:2104.09946, 2021 - arxiv.org
The task of isolating a target singing voice in music videos has useful applications. In this
work, we explore the single-channel singing voice separation problem from a multimodal …