Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

The rise and potential of large language model based agents: A survey

Z Xi, W Chen, X Guo, W He, Y Ding, B Hong… - arXiv preprint arXiv …, 2023 - arxiv.org
For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Contextual transformer networks for visual recognition

Y Li, T Yao, Y Pan, T Mei - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
Transformer with self-attention has led to the revolutionizing of natural language processing
field, and recently inspires the emergence of Transformer-style architecture design with …

Scaling up vision-language pre-training for image captioning

X Hu, Z Gan, J Wang, Z Yang, Z Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com
In recent years, we have witnessed significant performance boost in the image captioning
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …

A unified sequence interface for vision tasks

T Chen, S Saxena, L Li, TY Lin… - Advances in Neural …, 2022 - proceedings.neurips.cc
While language tasks are naturally expressed in a single, unified, modeling framework, ie,
generating sequences of tokens, this has not been the case in computer vision. As a result …

[HTML][HTML] A review of uncertainty quantification in deep learning: Techniques, applications and challenges

M Abdar, F Pourpanah, S Hussain, D Rezazadegan… - Information fusion, 2021 - Elsevier
Uncertainty quantification (UQ) methods play a pivotal role in reducing the impact of
uncertainties during both optimization and decision making processes. They have been …

Wave-vit: Unifying wavelet and transformers for visual representation learning

T Yao, Y Pan, Y Li, CW Ngo, T Mei - European Conference on Computer …, 2022 - Springer
Abstract Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for
computer vision tasks, while the self-attention computation in Transformer scales …