The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

A co-attention based multi-modal fusion network for review helpfulness prediction

G Ren, L Diao, F Guo, T Hong - Information Processing & Management, 2024 - Elsevier
The current review helpfulness prediction (RHP) methods simply rely on the textual features
and meta features to predict review helpfulness, overlooking the informational value of …

CAAN: Context-aware attention network for visual question answering

C Chen, D Han, CC Chang - Pattern Recognition, 2022 - Elsevier
Understanding multimodal information is the key to visual question answering (VQA) tasks.
Most existing approaches use attention mechanisms to acquire fine-grained information …

CLVIN: Complete language-vision interaction network for visual question answering

C Chen, D Han, X Shen - Knowledge-Based Systems, 2023 - Elsevier
The emergence of the Transformer optimizes the interactive modeling of multimodal
information in visual question answering (VQA) tasks, helping machines better understand …

Local self-attention in transformer for visual question answering

X Shen, D Han, Z Guo, C Chen, J Hua, G Luo - Applied Intelligence, 2023 - Springer
Abstract Visual Question Answering (VQA) is a multimodal task that requires models to
understand both textual and visual information. Various VQA models have applied the …

Change detection meets visual question answering

Z Yuan, L Mou, Z Xiong, XX Zhu - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
The Earth's surface is continually changing, and identifying changes plays an important role
in urban planning and sustainability. Although change detection techniques have been …

Learning visual question answering on controlled semantic noisy labels

H Zhang, P Zeng, Y Hu, J Qian, J Song, L Gao - Pattern Recognition, 2023 - Elsevier
Abstract Visual Question Answering (VQA) has made great progress recently due to the
increasing ability to understand and encode multi-modal inputs based on deep learning …

Co-attention fusion network for multimodal skin cancer diagnosis

X He, Y Wang, S Zhao, X Chen - Pattern Recognition, 2023 - Elsevier
Recently, multimodal image-based methods have shown great performance in skin cancer
diagnosis. These methods usually use convolutional neural networks (CNNs) to extract the …

Vlcdoc: Vision-language contrastive pre-training model for cross-modal document classification

S Bakkali, Z Ming, M Coustaty, M Rusiñol… - Pattern Recognition, 2023 - Elsevier
Multimodal learning from document data has achieved great success lately as it allows to
pre-train semantically meaningful features as a prior into a learnable downstream task. In …

Cascaded feature fusion with multi-level self-attention mechanism for object detection

C Wang, H Wang - Pattern Recognition, 2023 - Elsevier
Object detection has been a challenging task due to the complexity and diversity of objects.
The emergence of self-attention mechanism provides a new clue for feature fusion in object …