Sound-guided semantic image manipulation

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

被引用次数：51 相关文章所有 2 个版本

[PDF] google.com

A review on multimodal zero‐shot learning

W Cao, Y Wu, Y Sun, H Zhang, J Ren… - … : Data Mining and …, 2023 - Wiley Online Library

Multimodal learning provides a path to fully utilize all types of information related to the
modeling target to provide the model with a global vision. Zero‐shot learning (ZSL) is a …

被引用次数：21 相关文章所有 3 个版本

[PDF] arxiv.org

Gan inversion: A survey

W Xia, Y Zhang, Y Yang, JH Xue… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

GAN inversion aims to invert a given image back into the latent space of a pretrained GAN
model so that the image can be faithfully reconstructed from the inverted code by the …

被引用次数：530 相关文章所有 13 个版本

[PDF] thecvf.com

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

被引用次数：15 相关文章所有 4 个版本

[PDF] thecvf.com

Sound to visual scene generation by audio-to-visual latent alignment

K Sung-Bin, A Senocak, H Ha… - Proceedings of the …, 2023 - openaccess.thecvf.com

How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …

被引用次数：24 相关文章所有 6 个版本

[PDF] thecvf.com

Conditional generation of audio from video via foley analogies

Y Du, Z Chen, J Salamon, B Russell… - Proceedings of the …, 2023 - openaccess.thecvf.com

The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …

被引用次数：18 相关文章所有 7 个版本

[PDF] thecvf.com

Gluegen: Plug and play multi-modal encoders for x-to-image generation

C Qin, N Yu, C Xing, S Zhang, Z Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Text-to-image (T2I) models based on diffusion processes have achieved
remarkable success in controllable image generation using user-provided captions …

被引用次数：16 相关文章所有 6 个版本

[PDF] arxiv.org

Chat with the environment: Interactive multimodal perception using large language models

X Zhao, M Li, C Weber, MB Hafez… - 2023 IEEE/RSJ …, 2023 - ieeexplore.ieee.org

Programming robot behavior in a complex world faces challenges on multiple levels, from
dextrous low-level skills to high-level planning and reasoning. Recent pre-trained Large …

被引用次数：36 相关文章所有 5 个版本

[PDF] arxiv.org

Touch and go: Learning from human-collected vision and touch

F Yang, C Ma, J Zhang, J Zhu, W Yuan… - arXiv preprint arXiv …, 2022 - arxiv.org

The ability to associate touch with sight is essential for tasks that require physically
interacting with objects in the world. We propose a dataset with paired visual and tactile data …

被引用次数：32 相关文章所有 6 个版本

[PDF] arxiv.org

Gan-based facial attribute manipulation

Y Liu, Q Li, Q Deng, Z Sun… - IEEE transactions on …, 2023 - ieeexplore.ieee.org

Facial Attribute Manipulation (FAM) aims to aesthetically modify a given face image to
render desired attributes, which has received significant attention due to its broad practical …

被引用次数：16 相关文章所有 9 个版本