Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Imagebind: One embedding space to bind them all
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
Glipv2: Unifying localization and vision-language understanding
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …
Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval
D Jiang, M Ye - Proceedings of the IEEE/CVF Conference …, 2023 - openaccess.thecvf.com
Text-to-image person retrieval aims to identify the target person based on a given textual
description query. The primary challenge is to learn the mapping of visual and textual …
description query. The primary challenge is to learn the mapping of visual and textual …
Videoclip: Contrastive pre-training for zero-shot video-text understanding
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
Frozen in time: A joint video and image encoder for end-to-end retrieval
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
Scaling up visual and vision-language representation learning with noisy text supervision
Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …
representation learning in NLP has transitioned to training on raw text without human …
Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning
Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …
are the fundamental research problem for multimodal understanding and generation. The …
Learning transferable visual models from natural language supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …
object categories. This restricted form of supervision limits their generality and usability since …