Graph neural networks in vision-language image understanding: A survey

H Senior, G Slabaugh, S Yuan, L Rossi - The Visual Computer, 2024 - Springer
Abstract 2D image understanding is a complex problem within computer vision, but it holds
the key to providing human-level scene comprehension. It goes further than identifying the …

Generative multi-modal knowledge retrieval with large language models

X Long, J Zeng, F Meng, Z Ma, K Zhang… - Proceedings of the …, 2024 - ojs.aaai.org
Knowledge retrieval with multi-modal queries plays a crucial role in supporting knowledge-
intensive multi-modal applications. However, existing methods face challenges in terms of …

Lako: Knowledge-driven visual question answering via late knowledge-to-text injection

Z Chen, Y Huang, J Chen, Y Geng, Y Fang… - Proceedings of the 11th …, 2022 - dl.acm.org
Visual question answering (VQA) often requires an understanding of visual concepts and
language semantics, which relies on external knowledge. Most existing methods exploit pre …

Outside knowledge visual question answering version 2.0

BZ Reichman, A Sundar, C Richardson… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Visual question answering (VQA) lies at the intersection of language and vision research. It
functions as a building block for multimodal conversational AI and serves as a testbed for …

Towards reasoning-aware explainable vqa

R Vaideeswaran, F Gao, A Mathur, G Thattai - arXiv preprint arXiv …, 2022 - arxiv.org
The domain of joint vision-language understanding, especially in the context of reasoning in
Visual Question Answering (VQA) models, has garnered significant attention in the recent …

Cric: A vqa dataset for compositional reasoning on vision and commonsense

D Gao, R Wang, S Shan, X Chen - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Alternatively inferring on the visual facts and commonsense is fundamental for an advanced
visual question answering (VQA) system. This ability requires models to go beyond the …

Multimodal Information Retrieval

M Luo, T Gokhale, N Varshney, Y Yang… - Advances in Multimodal …, 2024 - Springer
In today's rapidly evolving digital landscape, the wealth of available information has
expanded beyond the boundaries of traditional text-based content. With the proliferation of …

A retriever-reader framework with visual entity linking for knowledge-based visual question answering

J You, Z Yang, Q Li, W Liu - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
In this paper, we propose a Retriever-Reader framework with Visual Entity Linking (RR-VEL)
for knowledge-based visual question answering. Given images and original questions, the …

CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model

D Go, T Whang, C Lee, H Kim, S Park, S Ji… - arXiv preprint arXiv …, 2024 - arxiv.org
The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language
Models (MLLMs) has expanded the scope of multimodal query resolution. However, current …

Breaking Boundaries Between Linguistics and Artificial Intelligence: Innovation in Vision-Language Matching for Multi-Modal Robots

J Wang, Y Tie, X Jiang, Y Xu - Journal of Organizational and End …, 2023 - igi-global.com
There is a wide connection between linguistics and artificial intelligence (AI), including the
multimodal language matching. Multi-modal robots possess the capability to process various …