How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arXiv preprint arXiv …, 2024 - arxiv.org
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Z Liu, T Chu, Y Zang, X Wei, X Dong, P Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Generating natural and meaningful responses to communicate with multi-modal human
inputs is a fundamental capability of Large Vision-Language Models (LVLMs). While current …

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

F Zhang, L Wu, H Bai, G Lin, X Li, X Yu, Y Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they
demand the comprehension of high-level instructions, complex reasoning, and the …

Towards Flexible Evaluation for Generative Visual Question Answering

H Ji, Q Si, Z Lin, W Wang - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
Throughout rapid development of multimodal large language models, a crucial ingredient is
a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual …

HRVMamba: High-Resolution Visual State Space Model for Dense Prediction

H Zhang, Y Ma, W Shao, P Luo, N Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, State Space Models (SSMs) with efficient hardware-aware designs, ie, Mamba,
have demonstrated significant potential in computer vision tasks due to their linear …

MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

EL Epstein, K Yao, J Li, X Bai, H Palangi - arXiv preprint arXiv:2409.18216, 2024 - arxiv.org
Evaluating instruction following capabilities for multimodal, multi-turn dialogue is
challenging. With potentially multiple instructions in the input model context, the task is time …

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

A Li, R Thapa, R Chalamala, Q Wu, K Chen… - arXiv preprint arXiv …, 2025 - arxiv.org
Vision-Language Models (VLMs) have shown strong performance in understanding single
images, aided by numerous high-quality instruction datasets. However, multi-image …