Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

A review on multimodal zero‐shot learning

W Cao, Y Wu, Y Sun, H Zhang, J Ren… - … : Data Mining and …, 2023 - Wiley Online Library
Multimodal learning provides a path to fully utilize all types of information related to the
modeling target to provide the model with a global vision. Zero‐shot learning (ZSL) is a …

Gan inversion: A survey

W Xia, Y Zhang, Y Yang, JH Xue… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
GAN inversion aims to invert a given image back into the latent space of a pretrained GAN
model so that the image can be faithfully reconstructed from the inverted code by the …

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com
The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

Sound to visual scene generation by audio-to-visual latent alignment

K Sung-Bin, A Senocak, H Ha… - Proceedings of the …, 2023 - openaccess.thecvf.com
How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …

Conditional generation of audio from video via foley analogies

Y Du, Z Chen, J Salamon, B Russell… - Proceedings of the …, 2023 - openaccess.thecvf.com
The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …

Gluegen: Plug and play multi-modal encoders for x-to-image generation

C Qin, N Yu, C Xing, S Zhang, Z Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Text-to-image (T2I) models based on diffusion processes have achieved
remarkable success in controllable image generation using user-provided captions …

Chat with the environment: Interactive multimodal perception using large language models

X Zhao, M Li, C Weber, MB Hafez… - 2023 IEEE/RSJ …, 2023 - ieeexplore.ieee.org
Programming robot behavior in a complex world faces challenges on multiple levels, from
dextrous low-level skills to high-level planning and reasoning. Recent pre-trained Large …

Touch and go: Learning from human-collected vision and touch

F Yang, C Ma, J Zhang, J Zhu, W Yuan… - arXiv preprint arXiv …, 2022 - arxiv.org
The ability to associate touch with sight is essential for tasks that require physically
interacting with objects in the world. We propose a dataset with paired visual and tactile data …

Gan-based facial attribute manipulation

Y Liu, Q Li, Q Deng, Z Sun… - IEEE transactions on …, 2023 - ieeexplore.ieee.org
Facial Attribute Manipulation (FAM) aims to aesthetically modify a given face image to
render desired attributes, which has received significant attention due to its broad practical …