A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

S Liu, W Pu, C Xu, Z Huang, Q Li, H Wang, C Lin… - 2024 - researchsquare.com
Recent advancements in MLLM, such as those exemplified by developments like GPT-4o,
have positioned them as a significant focus within the research community. MLLMs leverage …

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

Improving Long-Text Alignment for Text-to-Image Diffusion Models

L Liu, C Du, T Pang, Z Wang, C Li, D Xu - arXiv preprint arXiv:2410.11817, 2024 - arxiv.org
The rapid advancement of text-to-image (T2I) diffusion models has enabled them to
generate unprecedented results from given texts. However, as text inputs become longer …

Natural Language Inference Improves Compositionality in Vision-Language Models

P Cascante-Bonilla, Y Hou, YT Cao… - arXiv preprint arXiv …, 2024 - arxiv.org
Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these
models often struggle to relate objects, attributes, and spatial relationships. Recent methods …

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

B Ma, Z Zong, G Song, H Li, Y Liu - arXiv preprint arXiv:2406.11831, 2024 - arxiv.org
Large language models (LLMs) based on decoder-only transformers have demonstrated
superior text understanding capabilities compared to CLIP and T5-series models. However …

MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

S Mukhopadhyay, A Rajgaria, P Khatiwada… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and
linguistic information. A particularly promising yet under-explored application for these …