Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Graph neural networks in vision-language image understanding: A survey
Abstract 2D image understanding is a complex problem within computer vision, but it holds
the key to providing human-level scene comprehension. It goes further than identifying the …
the key to providing human-level scene comprehension. It goes further than identifying the …
M3S: Scene graph driven multi-granularity multi-task learning for multi-modal NER
J Wang, Y Yang, K Liu, Z Zhu… - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org
Multi-modal Named Entity Recognition (MNER), which mainly focuses on enhancing text-
only NER with visual information, has recently attracted considerable attention. Most current …
only NER with visual information, has recently attracted considerable attention. Most current …
A survey of efficient fine-tuning methods for Vision-Language Models—Prompt and Adapter
J Xing, J Liu, J Wang, L Sun, X Chen, X Gu… - Computers & Graphics, 2024 - Elsevier
Abstract Vision Language Model (VLM) is a popular research field located at the fusion of
computer vision and natural language processing (NLP). With the emergence of transformer …
computer vision and natural language processing (NLP). With the emergence of transformer …
Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrieval
X Qin, L Li, F Hao, M Ge, G Pang - Information Processing & Management, 2024 - Elsevier
Image–text retrieval plays a considerable role in associating vision and language. Existing
mainstream approaches focus on fine-grained alignment while ignoring the influence of …
mainstream approaches focus on fine-grained alignment while ignoring the influence of …
SelfGraphVQA: a self-supervised graph neural network for scene-based question answering
BC de Oliveira Souza, M Aasan… - Proceedings of the …, 2023 - openaccess.thecvf.com
The intersection of vision and language is of major interest due to the increased focus on
seamless integration between recognition and reasoning. Scene graphs (SGs) have …
seamless integration between recognition and reasoning. Scene graphs (SGs) have …
Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering
Visual question answering (VQA) requires systems to perform concept-level reasoning by
unifying unstructured (eg, the context in question and answer;" QA context") and structured …
unifying unstructured (eg, the context in question and answer;" QA context") and structured …
Generalized unbiased scene graph generation
Existing Unbiased Scene Graph Generation (USGG) methods only focus on addressing the
predicate-level imbalance that high-frequency classes dominate predictions of rare ones …
predicate-level imbalance that high-frequency classes dominate predictions of rare ones …
Multi-modal adaptive gated mechanism for visual question answering
Y Xu, L Zhang, X Shen - Plos one, 2023 - journals.plos.org
Visual Question Answering (VQA) is a multimodal task that uses natural language to ask and
answer questions based on image content. For multimodal tasks, obtaining accurate …
answer questions based on image content. For multimodal tasks, obtaining accurate …
Scenegate: Scene-graph based co-attention networks for text visual question answering
Visual Question Answering (VQA) models fail catastrophically on questions related to the
reading of text-carrying images. However, TextVQA aims to answer questions by …
reading of text-carrying images. However, TextVQA aims to answer questions by …