Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

[HTML][HTML] Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Scalable 3d captioning with pretrained models

T Luo, C Rockwell, H Lee… - Advances in Neural …, 2024 - proceedings.neurips.cc
We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects.
This approach utilizes pretrained models from image captioning, image-text alignment, and …

Filip: Fine-grained interactive language-image pre-training

L Yao, R Huang, L Hou, G Lu, M Niu, H Xu… - arXiv preprint arXiv …, 2021 - arxiv.org
Unsupervised large-scale vision-language pre-training has shown promising advances on
various downstream tasks. Existing methods often model the cross-modal interaction either …

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

S Changpinyo, P Sharma, N Ding… - Proceedings of the …, 2021 - openaccess.thecvf.com
The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Y Ma, G Xu, X Sun, M Yan, J Zhang, R Ji - Proceedings of the 30th ACM …, 2022 - dl.acm.org
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition

SC Huang, L Shen, MP Lungren… - Proceedings of the …, 2021 - openaccess.thecvf.com
In recent years, the growing number of medical imaging studies is placing an ever-
increasing burden on radiologists. Deep learning provides a promising solution for …