Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Graph neural networks in vision-language image understanding: A survey

H Senior, G Slabaugh, S Yuan, L Rossi - The Visual Computer, 2024 - Springer
Abstract 2D image understanding is a complex problem within computer vision, but it holds
the key to providing human-level scene comprehension. It goes further than identifying the …

M3S: Scene graph driven multi-granularity multi-task learning for multi-modal NER

J Wang, Y Yang, K Liu, Z Zhu… - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org
Multi-modal Named Entity Recognition (MNER), which mainly focuses on enhancing text-
only NER with visual information, has recently attracted considerable attention. Most current …

A survey of efficient fine-tuning methods for Vision-Language Models—Prompt and Adapter

J Xing, J Liu, J Wang, L Sun, X Chen, X Gu… - Computers & Graphics, 2024 - Elsevier
Abstract Vision Language Model (VLM) is a popular research field located at the fusion of
computer vision and natural language processing (NLP). With the emergence of transformer …

Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrieval

X Qin, L Li, F Hao, M Ge, G Pang - Information Processing & Management, 2024 - Elsevier
Image–text retrieval plays a considerable role in associating vision and language. Existing
mainstream approaches focus on fine-grained alignment while ignoring the influence of …

SelfGraphVQA: a self-supervised graph neural network for scene-based question answering

BC de Oliveira Souza, M Aasan… - Proceedings of the …, 2023 - openaccess.thecvf.com
The intersection of vision and language is of major interest due to the increased focus on
seamless integration between recognition and reasoning. Scene graphs (SGs) have …

Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering

Y Wang, M Yasunaga, H Ren… - Proceedings of the …, 2023 - openaccess.thecvf.com
Visual question answering (VQA) requires systems to perform concept-level reasoning by
unifying unstructured (eg, the context in question and answer;" QA context") and structured …

Generalized unbiased scene graph generation

X Lyu, L Gao, J Xie, P Zeng, Y Tian, J Shao… - arXiv preprint arXiv …, 2023 - arxiv.org
Existing Unbiased Scene Graph Generation (USGG) methods only focus on addressing the
predicate-level imbalance that high-frequency classes dominate predictions of rare ones …

Multi-modal adaptive gated mechanism for visual question answering

Y Xu, L Zhang, X Shen - Plos one, 2023 - journals.plos.org
Visual Question Answering (VQA) is a multimodal task that uses natural language to ask and
answer questions based on image content. For multimodal tasks, obtaining accurate …

Scenegate: Scene-graph based co-attention networks for text visual question answering

F Cao, S Luo, F Nunez, Z Wen, J Poon, SC Han - Robotics, 2023 - mdpi.com
Visual Question Answering (VQA) models fail catastrophically on questions related to the
reading of text-carrying images. However, TextVQA aims to answer questions by …