The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Graph neural networks: foundation, frontiers and applications

L Wu, P Cui, J Pei, L Zhao, X Guo - … of the 28th ACM SIGKDD Conference …, 2022 - dl.acm.org
The field of graph neural networks (GNNs) has seen rapid and incredible strides over the
recent years. Graph neural networks, also known as deep learning on graphs, graph …

Seeing out of the box: End-to-end pre-training for vision-language representation learning

Z Huang, Z Zeng, Y Huang, B Liu… - Proceedings of the …, 2021 - openaccess.thecvf.com
We study on joint learning of Convolutional Neural Network (CNN) and Transformer for
vision-language pre-training (VLPT) which aims to learn cross-modal alignments from …

Towards zero-shot learning: A brief review and an attention-based embedding network

GS Xie, Z Zhang, H Xiong, L Shao… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Zero-shot learning (ZSL), an emerging topic in recent years, targets at distinguishing unseen
class images by taking images from seen classes for training the classifier. Existing works …

Attention on attention for image captioning

L Huang, W Wang, J Chen… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …

Cross attention network for few-shot classification

R Hou, H Chang, B Ma, S Shan… - Advances in neural …, 2019 - proceedings.neurips.cc
Few-shot classification aims to recognize unlabeled samples from unseen classes given
only few labeled samples. The unseen classes and low-data problem make few-shot …

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

Transvpr: Transformer-based place recognition with multi-level attention aggregation

R Wang, Y Shen, W Zuo, S Zhou… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual place recognition is a challenging task for applications such as autonomous driving
navigation and mobile robot localization. Distracting elements presenting in complex scenes …

Camp: Cross-modal adaptive message passing for text-image retrieval

Z Wang, X Liu, H Li, L Sheng, J Yan… - Proceedings of the …, 2019 - openaccess.thecvf.com
Text-image cross-modal retrieval is a challenging task in the field of language and vision.
Most previous approaches independently embed images and sentences into a joint …

Relation-aware graph attention network for visual question answering

L Li, Z Gan, Y Cheng, J Liu - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
In order to answer semantically-complicated questions about an image, a Visual Question
Answering (VQA) model needs to fully understand the visual scene in the image, especially …