Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
The rise and potential of large language model based agents: A survey
For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Contextual transformer networks for visual recognition
Transformer with self-attention has led to the revolutionizing of natural language processing
field, and recently inspires the emergence of Transformer-style architecture design with …
field, and recently inspires the emergence of Transformer-style architecture design with …
Scaling up vision-language pre-training for image captioning
In recent years, we have witnessed significant performance boost in the image captioning
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …
Vinvl: Revisiting visual representations in vision-language models
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …
object detection model for vision language (VL) tasks. Compared to the most widely used …
A unified sequence interface for vision tasks
While language tasks are naturally expressed in a single, unified, modeling framework, ie,
generating sequences of tokens, this has not been the case in computer vision. As a result …
generating sequences of tokens, this has not been the case in computer vision. As a result …
[HTML][HTML] A review of uncertainty quantification in deep learning: Techniques, applications and challenges
Uncertainty quantification (UQ) methods play a pivotal role in reducing the impact of
uncertainties during both optimization and decision making processes. They have been …
uncertainties during both optimization and decision making processes. They have been …
Wave-vit: Unifying wavelet and transformers for visual representation learning
Abstract Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for
computer vision tasks, while the self-attention computation in Transformer scales …
computer vision tasks, while the self-attention computation in Transformer scales …