Graph neural networks in vision-language image understanding: A survey
Abstract 2D image understanding is a complex problem within computer vision, but it holds
the key to providing human-level scene comprehension. It goes further than identifying the …
the key to providing human-level scene comprehension. It goes further than identifying the …
Generative multi-modal knowledge retrieval with large language models
Knowledge retrieval with multi-modal queries plays a crucial role in supporting knowledge-
intensive multi-modal applications. However, existing methods face challenges in terms of …
intensive multi-modal applications. However, existing methods face challenges in terms of …
Lako: Knowledge-driven visual question answering via late knowledge-to-text injection
Visual question answering (VQA) often requires an understanding of visual concepts and
language semantics, which relies on external knowledge. Most existing methods exploit pre …
language semantics, which relies on external knowledge. Most existing methods exploit pre …
Outside knowledge visual question answering version 2.0
Visual question answering (VQA) lies at the intersection of language and vision research. It
functions as a building block for multimodal conversational AI and serves as a testbed for …
functions as a building block for multimodal conversational AI and serves as a testbed for …
Towards reasoning-aware explainable vqa
The domain of joint vision-language understanding, especially in the context of reasoning in
Visual Question Answering (VQA) models, has garnered significant attention in the recent …
Visual Question Answering (VQA) models, has garnered significant attention in the recent …
Cric: A vqa dataset for compositional reasoning on vision and commonsense
Alternatively inferring on the visual facts and commonsense is fundamental for an advanced
visual question answering (VQA) system. This ability requires models to go beyond the …
visual question answering (VQA) system. This ability requires models to go beyond the …
Multimodal Information Retrieval
In today's rapidly evolving digital landscape, the wealth of available information has
expanded beyond the boundaries of traditional text-based content. With the proliferation of …
expanded beyond the boundaries of traditional text-based content. With the proliferation of …
A retriever-reader framework with visual entity linking for knowledge-based visual question answering
In this paper, we propose a Retriever-Reader framework with Visual Entity Linking (RR-VEL)
for knowledge-based visual question answering. Given images and original questions, the …
for knowledge-based visual question answering. Given images and original questions, the …
CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model
The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language
Models (MLLMs) has expanded the scope of multimodal query resolution. However, current …
Models (MLLMs) has expanded the scope of multimodal query resolution. However, current …
Breaking Boundaries Between Linguistics and Artificial Intelligence: Innovation in Vision-Language Matching for Multi-Modal Robots
J Wang, Y Tie, X Jiang, Y Xu - Journal of Organizational and End …, 2023 - igi-global.com
There is a wide connection between linguistics and artificial intelligence (AI), including the
multimodal language matching. Multi-modal robots possess the capability to process various …
multimodal language matching. Multi-modal robots possess the capability to process various …