VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

文章

学术资源搜索

获得 1 条结果（用时0.02秒）

我的图书馆

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

在引用文章中搜索

[PDF] arxiv.org

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

C Cheng, J Guan, W Wu, R Yan - arXiv preprint arXiv:2406.19934, 2024 - arxiv.org

We explore multi-step reasoning in vision-language models (VLMs). The problem is
challenging, as reasoning data consisting of multiple steps of visual and language …