Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Multimodal machine learning: A survey and taxonomy
T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …
odors, and taste flavors. Modality refers to the way in which something happens or is …
Instructdiffusion: A generalist modeling interface for vision tasks
We present InstructDiffusion a unified and generic framework for aligning computer vision
tasks with human instructions. Unlike existing approaches that integrate prior knowledge …
tasks with human instructions. Unlike existing approaches that integrate prior knowledge …
Vision-language pre-training with triple contrastive learning
Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …
Unified contrastive learning in image-text-label space
Visual recognition is recently learned via either supervised learning on human-annotated
image-label data or language-image contrastive learning with webly-crawled image-text …
image-label data or language-image contrastive learning with webly-crawled image-text …
Frozen in time: A joint video and image encoder for end-to-end retrieval
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
Towards language-free training for text-to-image generation
One of the major challenges in training text-to-image generation models is the need of a
large number of high-quality text-image pairs. While image samples are often easily …
large number of high-quality text-image pairs. While image samples are often easily …
How much can clip benefit vision-and-language tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer
The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …
several areas, especially in computer vision. The growing potential of multimodal data …
Transvg: End-to-end visual grounding with transformers
In this paper, we present a neat yet effective transformer-based framework for visual
grounding, namely TransVG, to address the task of grounding a language query to the …
grounding, namely TransVG, to address the task of grounding a language query to the …