Dettoolchain: A new prompting paradigm to unleash detection ability of mllm
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object
detection ability of multimodal large language models (MLLMs), such as GPT-4V and …
detection ability of multimodal large language models (MLLMs), such as GPT-4V and …
Freeva: Offline mllm as training-free video assistant
W Wu - arXiv preprint arXiv:2405.07798, 2024 - arxiv.org
This paper undertakes an empirical study to revisit the latest advancements in Multimodal
Large Language Models (MLLMs): Video Assistant. This study, namely FreeVA, aims to …
Large Language Models (MLLMs): Video Assistant. This study, namely FreeVA, aims to …
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation
Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a
target destination, has recently seen significant advancements. In contrast to navigation in …
target destination, has recently seen significant advancements. In contrast to navigation in …
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of
tasks including visual question answering, recognising objects, and spatial referral. In this …
tasks including visual question answering, recognising objects, and spatial referral. In this …