Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Pali: A jointly-scaled multilingual language-image model
Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …
Mm-vet: Evaluating large multimodal models for integrated capabilities
We propose MM-Vet, an evaluation benchmark that examines large multimodal models
(LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …
(LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …
Git: A generative image-to-text transformer for vision and language
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …
vision-language tasks such as image/video captioning and question answering. While …
Flamingo: a visual language model for few-shot learning
Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …
examples is an open challenge for multimodal machine learning research. We introduce …
Pix2struct: Screenshot parsing as pretraining for visual language understanding
Visually-situated language is ubiquitous—sources range from textbooks with diagrams to
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …
Scaling up vision-language pre-training for image captioning
In recent years, we have witnessed significant performance boost in the image captioning
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …
Learning transferable visual models from natural language supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …
object categories. This restricted form of supervision limits their generality and usability since …
Coarse-to-fine vision-language pre-training with fusion in the backbone
Abstract Vision-language (VL) pre-training has recently received considerable attention.
However, most existing end-to-end pre-training approaches either only aim to tackle VL …
However, most existing end-to-end pre-training approaches either only aim to tackle VL …