An overview of deep-learning-based audio-visual speech enhancement and separation
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …
extract either one or more target speech signals, respectively, from a mixture of sounds …
Deep learning-based face super-resolution: A survey
Face super-resolution (FSR), also known as face hallucination, which is aimed at enhancing
the resolution of low-resolution (LR) face images to generate high-resolution face images, is …
the resolution of low-resolution (LR) face images to generate high-resolution face images, is …
Not only look, but also listen: Learning multimodal violence detection under weak supervision
Violence detection has been studied in computer vision for years. However, previous work
are either superficial, eg, classification of short-clips, and the single scenario, or …
are either superficial, eg, classification of short-clips, and the single scenario, or …
Visualvoice: Audio-visual speech separation with cross-modal consistency
We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …
to extract the speech associated with a face in spite of simultaneous back-ground sounds …
Sound to visual scene generation by audio-to-visual latent alignment
How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …
generating an image of a scene from sound. Our method addresses the challenges of …
Deep audio-visual learning: A survey
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …
modalities, has drawn considerable attention since deep learning started to be used …
Audio-driven talking face video generation with learning-based personalized head pose
Real-world talking faces often accompany with natural head movement. However, most
existing talking face video generation methods only consider facial animation with fixed …
existing talking face video generation methods only consider facial animation with fixed …
Voice-face homogeneity tells deepfake
Detecting forgery videos is highly desirable due to the abuse of deepfake. Existing detection
approaches contribute to exploring the specific artifacts in deepfake videos and fit well on …
approaches contribute to exploring the specific artifacts in deepfake videos and fit well on …
Cross-modal relation-aware networks for audio-visual event localization
We address the challenging task of event localization, which requires the machine to
localize an event and recognize its category in unconstrained videos. Most existing methods …
localize an event and recognize its category in unconstrained videos. Most existing methods …
Sound-guided semantic image manipulation
The recent success of the generative model shows that leveraging the multi-modal
embedding space can manipulate an image using text information. However, manipulating …
embedding space can manipulate an image using text information. However, manipulating …