How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …
supports long-contextual input and output. IXC-2.5 excels in various text-image …
Mme-survey: A comprehensive survey on evaluation of multimodal llms
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …
Models (MLLMs) have garnered increased attention from both industry and academia …
Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …
multimodal applications such as visual dialogue and embodied navigation. However …
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Generating natural and meaningful responses to communicate with multi-modal human
inputs is a fundamental capability of Large Vision-Language Models (LVLMs). While current …
inputs is a fundamental capability of Large Vision-Language Models (LVLMs). While current …
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks
Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they
demand the comprehension of high-level instructions, complex reasoning, and the …
demand the comprehension of high-level instructions, complex reasoning, and the …
Towards Flexible Evaluation for Generative Visual Question Answering
Throughout rapid development of multimodal large language models, a crucial ingredient is
a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual …
a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual …
HRVMamba: High-Resolution Visual State Space Model for Dense Prediction
Recently, State Space Models (SSMs) with efficient hardware-aware designs, ie, Mamba,
have demonstrated significant potential in computer vision tasks due to their linear …
have demonstrated significant potential in computer vision tasks due to their linear …
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark
Evaluating instruction following capabilities for multimodal, multi-turn dialogue is
challenging. With potentially multiple instructions in the input model context, the task is time …
challenging. With potentially multiple instructions in the input model context, the task is time …
SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning
Vision-Language Models (VLMs) have shown strong performance in understanding single
images, aided by numerous high-quality instruction datasets. However, multi-image …
images, aided by numerous high-quality instruction datasets. However, multi-image …