Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

Learn to explain: Multimodal reasoning via thought chains for science question answering

P Lu, S Mishra, T Xia, L Qiu… - Advances in …, 2022 - proceedings.neurips.cc
When answering a question, humans utilize the information available across different
modalities to synthesize a consistent and complete chain of thought (CoT). This process is …

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

W Lin, J Chen, J Mei, A Coca… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …

Promptcap: Prompt-guided task-aware image captioning

Y Hu, H Hua, Z Yang, W Shi, NA Smith… - arXiv preprint arXiv …, 2022 - arxiv.org
Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …

Promptcap: Prompt-guided image captioning for vqa with gpt-3

Y Hu, H Hua, Z Yang, W Shi… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Knowledge-based visual question answering (VQA) involves questions that require
world knowledge beyond the image to yield the correct answer. Large language models …

Benchlmm: Benchmarking cross-style visual capability of large multimodal models

R Cai, Z Song, D Guan, Z Chen, Y Li, X Luo… - … on Computer Vision, 2025 - Springer
Abstract Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown
remarkable capabilities in visual reasoning on data in common image styles. However, their …

Sqa3d: Situated question answering in 3d scenes

X Ma, S Yong, Z Zheng, Q Li, Y Liang, SC Zhu… - arXiv preprint arXiv …, 2022 - arxiv.org
We propose a new task to benchmark scene understanding of embodied agents: Situated
Question Answering in 3D Scenes (SQA3D). Given a scene context (eg, 3D scan), SQA3D …

Cotdet: Affordance knowledge prompting for task driven object detection

J Tang, G Zheng, J Yu, S Yang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Task driven object detection aims to detect object instances suitable for affording a task in an
image. Its challenge lies in object categories available for the task being too diverse to be …