How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its
progression has been hindered by challenges in comprehending fine-grained visual content …

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

S Hao, Y Gu, H Luo, T Liu, X Shao, X Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Generating accurate step-by-step reasoning is essential for Large Language Models (LLMs)
to address complex problems and enhance robustness and interpretability. Despite the flux …

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

K Huang, F Mo, H Li, Y Li, Y Zhang, W Yi, Y Mao… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of Large Language Models (LLMs) demonstrates remarkable
multilingual capabilities in natural language processing, attracting global attention in both …

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

J Tang, C Lin, Z Zhao, S Wei, B Wu, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Text-centric visual question answering (VQA) has made great strides with the development
of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of …

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

C Wang, H Duan, S Zhang, D Lin, K Chen - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, the large language model (LLM) community has shown increasing interest in
enhancing LLMs' capability to handle extremely long documents. As various long-text …

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Y Song, H Xie, Z Zhang, B Wen, L Ma, Z Mi… - arXiv preprint arXiv …, 2024 - arxiv.org
Exploiting activation sparsity is a promising approach to significantly accelerating the
inference process of large language models (LLMs) without compromising performance …

Needle In A Multimodal Haystack

W Wang, S Zhang, Y Ren, Y Duan, T Li, S Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rapid advancement of multimodal large language models (MLLMs), their evaluation
has become increasingly comprehensive. However, understanding long multimodal content …

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

H Que, J Liu, G Zhang, C Zhang, X Qu, Y Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to
expand the model's fundamental understanding of specific downstream domains (eg, math …