Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

W Shi, Z Hu, Y Bin, J Liu, Y Yang, SK Ng, L Bing… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated impressive reasoning capabilities,
particularly in textual mathematical problem-solving. However, existing open-source image …

Quality assessment in the era of large models: A survey

Z Zhang, Y Zhou, C Li, B Zhao, X Liu, G Zhai - arXiv preprint arXiv …, 2024 - arxiv.org
Quality assessment, which evaluates the visual quality level of multimedia experiences, has
garnered significant attention from researchers and has evolved substantially through …

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization

W Wang, Z Chen, W Wang, Y Cao, Y Liu, Z Gao… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing open-source multimodal large language models (MLLMs) generally follow a
training process involving pre-training and supervised fine-tuning. However, these models …

Improve vision language model chain-of-thought reasoning

R Zhang, B Zhang, Y Li, H Zhang, Z Sun, Z Gan… - arXiv preprint arXiv …, 2024 - arxiv.org
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving
interpretability and trustworthiness. However, current training recipes lack robust CoT …

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

C Zou, X Guo, R Yang, J Zhang, B Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …

Mathscape: Evaluating mllms in multimodal math scenarios through a hierarchical benchmark

M Zhou, H Liang, T Li, Z Wu, M Lin, L Sun… - arXiv preprint arXiv …, 2024 - arxiv.org
With the development of Multimodal Large Language Models (MLLMs), the evaluation of
multimodal models in the context of mathematical problems has become a valuable …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J Xiao, L Chen - arXiv preprint arXiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

From introspection to best practices: Principled analysis of demonstrations in multimodal in-context learning

N Xu, F Wang, S Zhang, H Poon, M Chen - arXiv preprint arXiv …, 2024 - arxiv.org
Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs),
multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities …