A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - National Science …, 2024 - academic.oup.com
Abstract Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has
been a new rising research hotspot, which uses powerful Large Language Models (LLMs) …

Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Mm-vet: Evaluating large multimodal models for integrated capabilities

W Yu, Z Yang, L Li, J Wang, K Lin, Z Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose MM-Vet, an evaluation benchmark that examines large multimodal models
(LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

P Xu, W Shao, K Zhang, P Gao, S Liu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …

A multimodal generative AI copilot for human pathology

MY Lu, B Chen, DFK Williamson, RJ Chen, M Zhao… - Nature, 2024 - nature.com
Computational pathology, has witnessed considerable progress in the development of both
task-specific predictive models and task-agnostic self-supervised vision encoders …

Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

P Zhang, X Dong, B Wang, Y Cao, C Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose InternLM-XComposer, a vision-language large model that enables advanced
image-text comprehension and composition. The innovative nature of our model is …

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Drivevlm: The convergence of autonomous driving and large vision-language models

X Tian, J Gu, B Li, Y Liu, Y Wang, Z Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
A primary hurdle of autonomous driving in urban environments is understanding complex
and long-tail scenarios, such as challenging road conditions and delicate human behaviors …

Octopack: Instruction tuning code large language models

N Muennighoff, Q Liu, A Zebaze, Q Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org
Finetuning large language models (LLMs) on instructions leads to vast performance
improvements on natural language tasks. We apply instruction tuning using code …

HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

T Guan, F Liu, X Wu, R Xian, Z Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …