Knowledge graphs meet multi-modal learning: A comprehensive survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
Vipergpt: Visual inference via python execution for reasoning
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
Learn to explain: Multimodal reasoning via thought chains for science question answering
When answering a question, humans utilize the information available across different
modalities to synthesize a consistent and complete chain of thought (CoT). This process is …
modalities to synthesize a consistent and complete chain of thought (CoT). This process is …
Prompting large language models with answer heuristics for knowledge-based visual question answering
Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …
beyond the image to answer the question. Early studies retrieve required knowledge from …
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering
Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …
utilize knowledge from external knowledge bases to answer visually-grounded questions …
Promptcap: Prompt-guided task-aware image captioning
Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …
Promptcap: Prompt-guided image captioning for vqa with gpt-3
Abstract Knowledge-based visual question answering (VQA) involves questions that require
world knowledge beyond the image to yield the correct answer. Large language models …
world knowledge beyond the image to yield the correct answer. Large language models …
Benchlmm: Benchmarking cross-style visual capability of large multimodal models
Abstract Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown
remarkable capabilities in visual reasoning on data in common image styles. However, their …
remarkable capabilities in visual reasoning on data in common image styles. However, their …
Sqa3d: Situated question answering in 3d scenes
We propose a new task to benchmark scene understanding of embodied agents: Situated
Question Answering in 3D Scenes (SQA3D). Given a scene context (eg, 3D scan), SQA3D …
Question Answering in 3D Scenes (SQA3D). Given a scene context (eg, 3D scan), SQA3D …
Cotdet: Affordance knowledge prompting for task driven object detection
Task driven object detection aims to detect object instances suitable for affording a task in an
image. Its challenge lies in object categories available for the task being too diverse to be …
image. Its challenge lies in object categories available for the task being too diverse to be …