A survey on hallucination in large vision-language models

H Liu, W Xue, Y Chen, D Chen, X Zhao, K Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent development of Large Vision-Language Models (LVLMs) has attracted growing
attention within the AI landscape for its practical implementation potential. However,`` …

Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Mm1: Methods, analysis & insights from multimodal llm pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Scibench: Evaluating college-level scientific problem-solving abilities of large language models

X Wang, Z Hu, P Lu, Y Zhu, J Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in large language models (LLMs) have demonstrated notable progress on
many mathematical benchmarks. However, most of these benchmarks only feature problems …

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

Gsva: Generalized segmentation via multimodal large language models

Z Xia, D Han, Y Han, X Pan, S Song… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Generalized Referring Expression Segmentation (GRES) extends the scope of
classic RES to refer to multiple objects in one expression or identify the empty targets absent …