X-linear attention networks for image captioning

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：186 相关文章所有 7 个版本

[PDF] arxiv.org

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：378 相关文章所有 11 个版本

[PDF] github.io

The rise and potential of large language model based agents: A survey

Z Xi, W Chen, X Guo, W He, Y Ding, B Hong… - arXiv preprint arXiv …, 2023 - arxiv.org

For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …

被引用次数：637 相关文章所有 4 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：564 相关文章所有 9 个版本

[PDF] arxiv.org

Contextual transformer networks for visual recognition

Y Li, T Yao, Y Pan, T Mei - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org

Transformer with self-attention has led to the revolutionizing of natural language processing
field, and recently inspires the emergence of Transformer-style architecture design with …

被引用次数：561 相关文章所有 7 个版本

[PDF] thecvf.com

Scaling up vision-language pre-training for image captioning

X Hu, Z Gan, J Wang, Z Yang, Z Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com

In recent years, we have witnessed significant performance boost in the image captioning
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …

被引用次数：297 相关文章所有 5 个版本

[PDF] thecvf.com

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com

This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …

被引用次数：1087 相关文章所有 8 个版本

[PDF] neurips.cc

A unified sequence interface for vision tasks

T Chen, S Saxena, L Li, TY Lin… - Advances in Neural …, 2022 - proceedings.neurips.cc

While language tasks are naturally expressed in a single, unified, modeling framework, ie,
generating sequences of tokens, this has not been the case in computer vision. As a result …

被引用次数：137 相关文章所有 11 个版本

[HTML] sciencedirect.com

[HTML][HTML] A review of uncertainty quantification in deep learning: Techniques, applications and challenges

M Abdar, F Pourpanah, S Hussain, D Rezazadegan… - Information fusion, 2021 - Elsevier

Uncertainty quantification (UQ) methods play a pivotal role in reducing the impact of
uncertainties during both optimization and decision making processes. They have been …

被引用次数：2325 相关文章所有 12 个版本

[PDF] arxiv.org

Wave-vit: Unifying wavelet and transformers for visual representation learning

T Yao, Y Pan, Y Li, CW Ngo, T Mei - European Conference on Computer …, 2022 - Springer

Abstract Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for
computer vision tasks, while the self-attention computation in Transformer scales …

被引用次数：149 相关文章所有 7 个版本