An overview of deep-learning-based audio-visual speech enhancement and separation
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …
extract either one or more target speech signals, respectively, from a mixture of sounds …
A better use of audio-visual cues: Dense video captioning with bi-modal transformer
Dense video captioning aims to localize and describe important events in untrimmed videos.
Existing methods mainly tackle this task by exploiting only visual features, while completely …
Existing methods mainly tackle this task by exploiting only visual features, while completely …
Taming visually guided sound generation
Recent advances in visually-induced audio generation are based on sampling short, low-
fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the …
fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the …
Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds
Recent progress in deep learning has enabled many advances in sound separation and
visual scene understanding. However, extracting sound sources which are apparent in …
visual scene understanding. However, extracting sound sources which are apparent in …
iquery: Instruments as queries for audio-visual sound separation
Current audio-visual separation methods share a standard architecture design where an
audio encoder-decoder network is fused with visual encoding features at the encoder …
audio encoder-decoder network is fused with visual encoding features at the encoder …
Recent advances and challenges in deep audio-visual correlation learning
Audio-visual correlation learning aims to capture essential correspondences and
understand natural phenomena between audio and video. With the rapid growth of deep …
understand natural phenomena between audio and video. With the rapid growth of deep …
Lavss: Location-guided audio-visual spatial audio separation
Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …
visual separation (MAVS). However, most MAVS methods purely consider what the sound …
Visually guided sound source separation and localization using self-supervised motion representations
In this paper, we perform audio-visual sound source separation, ie to separate component
audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint …
audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint …
V-slowfast network for efficient visual sound separation
The objective of this paper is to perform visual sound separation: i) we study visual sound
separation on spectrograms of different temporal resolutions; ii) we propose a new light yet …
separation on spectrograms of different temporal resolutions; ii) we propose a new light yet …
A cappella: Audio-visual singing voice separation
JF Montesinos, VS Kadandale, G Haro - arXiv preprint arXiv:2104.09946, 2021 - arxiv.org
The task of isolating a target singing voice in music videos has useful applications. In this
work, we explore the single-channel singing voice separation problem from a multimodal …
work, we explore the single-channel singing voice separation problem from a multimodal …