From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

C Cheng, J Guan, W Wu, R Yan - arXiv preprint arXiv:2406.19934, 2024 - arxiv.org
We explore multi-step reasoning in vision-language models (VLMs). The problem is
challenging, as reasoning data consisting of multiple steps of visual and language …