The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Multi-task learning with deep neural networks: A survey

M Crawshaw - arXiv preprint arXiv:2009.09796, 2020 - arxiv.org
Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are
simultaneously learned by a shared model. Such approaches offer advantages like …

Multimodal co-attention transformer for survival prediction in gigapixel whole slide images

RJ Chen, MY Lu, WH Weng, TY Chen… - Proceedings of the …, 2021 - openaccess.thecvf.com
Survival outcome prediction is a challenging weakly-supervised and ordinal regression task
in computational pathology that involves modeling complex interactions within the tumor …

Deep modular co-attention networks for visual question answering

Z Yu, J Yu, Y Cui, D Tao, Q Tian - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Abstract Visual Question Answering (VQA) requires a fine-grained and simultaneous
understanding of both the visual content of images and the textual content of questions …

Changer: Feature interaction is what you need for change detection

S Fang, K Li, Z Li - IEEE Transactions on Geoscience and …, 2023 - ieeexplore.ieee.org
Change detection is an important tool for long-term Earth observation missions. It takes bi-
temporal images as input and predicts “where” the change has occurred. Different from other …

[PDF][PDF] Multimodal fusion with co-attention networks for fake news detection

Y Wu, P Zhan, Y Zhang, L Wang… - Findings of the association …, 2021 - aclanthology.org
Fake news with textual and visual contents has a better story-telling ability than text-only
contents, and can be spread quickly with social media. People can be easily deceived by …

Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition

W Liu, JL Qiu, WL Zheng, BL Lu - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Multimodal signals are powerful for emotion recognition since they can represent emotions
comprehensively. In this article, we compare the recognition performance and robustness of …

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering

Y Ding, J Yu, B Liu, Y Hu, M Cui… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Abstract Knowledge-based visual question answering requires the ability of associating
external knowledge for open-ended cross-modal scene understanding. One limitation of …