Mini-gemini: Mining the potential of multi-modality vision language models

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

R Xu, Y Yao, Z Guo, J Cui, Z Ni, C Ge, TS Chua… - arXiv preprint arXiv …, 2024 - arxiv.org
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding
the visual world. Conventional LMMs process images in fixed sizes and limited resolutions …

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org
Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

Visual instruction tuning towards general-purpose multimodal model: A survey

J Huang, J Zhang, K Jiang, H Qiu, S Lu - arXiv preprint arXiv:2312.16602, 2023 - arxiv.org
Traditional computer vision generally solves each single task independently by a dedicated
model with the task instruction implicitly designed in the model architecture, arising two …

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

R Huang, X Ding, C Wang, J Han, Y Liu, H Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer
visual details, enhancing their comprehension capabilities. To reduce the training and …

How Far Are We From AGI

T Feng, C Jin, J Liu, K Zhu, H Tu, Z Cheng… - arXiv preprint arXiv …, 2024 - arxiv.org
The evolution of artificial intelligence (AI) has profoundly impacted human society, driving
significant advancements in multiple sectors. Yet, the escalating demands on AI have …

A Survey of Multimodal Large Language Model from A Data-centric Perspective

T Bai, H Liang, B Wan, L Yang, B Li, Y Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Human beings perceive the world through diverse senses such as sight, smell, hearing, and
touch. Similarly, multimodal large language models (MLLMs) enhance the capabilities of …

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

BK Lee, CW Kim, B Park, YM Ro - arXiv preprint arXiv:2405.15574, 2024 - arxiv.org
The rapid development of large language and vision models (LLVMs) has been driven by
advances in visual instruction tuning. Recently, open-source LLVMs have curated high …

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

C Ge, S Cheng, Z Wang, J Yuan, Y Gao, J Song… - arXiv preprint arXiv …, 2024 - arxiv.org
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive
visual tokens and quadratic visual complexity. Current high-resolution LMMs address the …