Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

被引用次数：39 相关文章所有 2 个版本

[PDF] thecvf.com

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

被引用次数：394 相关文章所有 6 个版本

[PDF] neurips.cc

Learn to explain: Multimodal reasoning via thought chains for science question answering

P Lu, S Mishra, T Xia, L Qiu… - Advances in …, 2022 - proceedings.neurips.cc

When answering a question, humans utilize the information available across different
modalities to synthesize a consistent and complete chain of thought (CoT). This process is …

被引用次数：882 相关文章所有 8 个版本

[PDF] thecvf.com

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

被引用次数：197 相关文章所有 5 个版本

[PDF] neurips.cc

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

W Lin, J Chen, J Mei, A Coca… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …

被引用次数：35 相关文章所有 5 个版本

[PDF] arxiv.org

Promptcap: Prompt-guided task-aware image captioning

Y Hu, H Hua, Z Yang, W Shi, NA Smith… - arXiv preprint arXiv …, 2022 - arxiv.org

Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …

被引用次数：97 相关文章所有 2 个版本

[PDF] thecvf.com

Promptcap: Prompt-guided image captioning for vqa with gpt-3

Y Hu, H Hua, Z Yang, W Shi… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) involves questions that require
world knowledge beyond the image to yield the correct answer. Large language models …

被引用次数：37 相关文章所有 4 个版本

[PDF] arxiv.org

Benchlmm: Benchmarking cross-style visual capability of large multimodal models

R Cai, Z Song, D Guan, Z Chen, Y Li, X Luo… - … on Computer Vision, 2025 - Springer

Abstract Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown
remarkable capabilities in visual reasoning on data in common image styles. However, their …

被引用次数：33 相关文章所有 2 个版本

[PDF] arxiv.org

Sqa3d: Situated question answering in 3d scenes

X Ma, S Yong, Z Zheng, Q Li, Y Liang, SC Zhu… - arXiv preprint arXiv …, 2022 - arxiv.org

We propose a new task to benchmark scene understanding of embodied agents: Situated
Question Answering in 3D Scenes (SQA3D). Given a scene context (eg, 3D scan), SQA3D …

被引用次数：109 相关文章所有 5 个版本

[PDF] thecvf.com

Cotdet: Affordance knowledge prompting for task driven object detection

J Tang, G Zheng, J Yu, S Yang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Task driven object detection aims to detect object instances suitable for affording a task in an
image. Its challenge lies in object categories available for the task being too diverse to be …

被引用次数：18 相关文章所有 5 个版本