Dettoolchain: A new prompting paradigm to unleash detection ability of mllm

Y Wu, Y Wang, S Tang, W Wu, T He, W Ouyang… - … on Computer Vision, 2025 - Springer
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object
detection ability of multimodal large language models (MLLMs), such as GPT-4V and …

Freeva: Offline mllm as training-free video assistant

W Wu - arXiv preprint arXiv:2405.07798, 2024 - arxiv.org
This paper undertakes an empirical study to revisit the latest advancements in Multimodal
Large Language Models (MLLMs): Video Assistant. This study, namely FreeVA, aims to …

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

G Dai, J Zhao, Y Chen, Y Qin, H Zhao, G Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a
target destination, has recently seen significant advancements. In contrast to navigation in …

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

S Bansal, M Wray, D Damen - arXiv preprint arXiv:2404.09933, 2024 - arxiv.org
Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of
tasks including visual question answering, recognising objects, and spatial referral. In this …