SGEITL: Scene graph enhanced image-text learning for visual commonsense reasoning

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：512 相关文章所有 9 个版本

[PDF] springer.com

Graph neural networks in vision-language image understanding: A survey

H Senior, G Slabaugh, S Yuan, L Rossi - The Visual Computer, 2024 - Springer

Abstract 2D image understanding is a complex problem within computer vision, but it holds
the key to providing human-level scene comprehension. It goes further than identifying the …

被引用次数：13 相关文章所有 7 个版本

M3S: Scene graph driven multi-granularity multi-task learning for multi-modal NER

J Wang, Y Yang, K Liu, Z Zhu… - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org

Multi-modal Named Entity Recognition (MNER), which mainly focuses on enhancing text-
only NER with visual information, has recently attracted considerable attention. Most current …

被引用次数：25 相关文章所有 2 个版本

A survey of efficient fine-tuning methods for Vision-Language Models—Prompt and Adapter

J Xing, J Liu, J Wang, L Sun, X Chen, X Gu… - Computers & Graphics, 2024 - Elsevier

Abstract Vision Language Model (VLM) is a popular research field located at the fusion of
computer vision and natural language processing (NLP). With the emergence of transformer …

被引用次数：9 相关文章

Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrieval

X Qin, L Li, F Hao, M Ge, G Pang - Information Processing & Management, 2024 - Elsevier

Image–text retrieval plays a considerable role in associating vision and language. Existing
mainstream approaches focus on fine-grained alignment while ignoring the influence of …

被引用次数：3 相关文章所有 2 个版本

[PDF] thecvf.com

SelfGraphVQA: a self-supervised graph neural network for scene-based question answering

BC de Oliveira Souza, M Aasan… - Proceedings of the …, 2023 - openaccess.thecvf.com

The intersection of vision and language is of major interest due to the increased focus on
seamless integration between recognition and reasoning. Scene graphs (SGs) have …

被引用次数：4 相关文章所有 7 个版本

[PDF] thecvf.com

Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering

Y Wang, M Yasunaga, H Ren… - Proceedings of the …, 2023 - openaccess.thecvf.com

Visual question answering (VQA) requires systems to perform concept-level reasoning by
unifying unstructured (eg, the context in question and answer;" QA context") and structured …

被引用次数：23 相关文章所有 4 个版本

[PDF] arxiv.org

Generalized unbiased scene graph generation

X Lyu, L Gao, J Xie, P Zeng, Y Tian, J Shao… - arXiv preprint arXiv …, 2023 - arxiv.org

Existing Unbiased Scene Graph Generation (USGG) methods only focus on addressing the
predicate-level imbalance that high-frequency classes dominate predictions of rare ones …

被引用次数：6 相关文章所有 2 个版本

[PDF] plos.org

Multi-modal adaptive gated mechanism for visual question answering

Y Xu, L Zhang, X Shen - Plos one, 2023 - journals.plos.org

Visual Question Answering (VQA) is a multimodal task that uses natural language to ask and
answer questions based on image content. For multimodal tasks, obtaining accurate …

被引用次数：4 相关文章所有 7 个版本

[PDF] mdpi.com

Scenegate: Scene-graph based co-attention networks for text visual question answering

F Cao, S Luo, F Nunez, Z Wen, J Poon, SC Han - Robotics, 2023 - mdpi.com

Visual Question Answering (VQA) models fail catastrophically on questions related to the
reading of text-carrying images. However, TextVQA aims to answer questions by …

被引用次数：8 相关文章所有 7 个版本