Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

A comprehensive review of multimodal large language models: Performance and challenges across different tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arXiv preprint arXiv …, 2024 - arxiv.org
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Elysium: Exploring object-level perception in videos via mllm

H Wang, Y Ye, Y Wang, Y Nie, C Huang - European Conference on …, 2025 - Springer
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated their ability to
perceive objects in still images, but their application in video-related tasks, such as object …

Training-free video temporal grounding using large-scale pre-trained models

M Zheng, X Cai, Q Chen, Y Peng, Y Liu - European Conference on …, 2025 - Springer
Video temporal grounding aims to identify video segments within untrimmed videos that are
most relevant to a given natural language query. Existing video temporal localization models …

Language-driven visual consensus for zero-shot semantic segmentation

Z Zhang, W Ke, Y Zhu, X Liang, J Liu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The pre-trained vision-language model, exemplified by CLIP [1], advances zero-shot
semantic segmentation by aligning visual features with class embeddings through a …

The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio

S Leng, Y Xing, Z Cheng, Y Zhou, H Zhang, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in large multimodal models (LMMs) have significantly enhanced
performance across diverse tasks, with ongoing efforts to further integrate additional …

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

Unifiedmllm: Enabling unified representation for multi-modal multi-tasks with large language model

Z Li, W Wang, YQ Cai, X Qi, P Wang, D Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Significant advancements has recently been achieved in the field of multi-modal large
language models (MLLMs), demonstrating their remarkable capabilities in understanding …

V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

H Hua, Y Tang, C Xu, J Luo - arXiv preprint arXiv:2404.12353, 2024 - arxiv.org
Video summarization aims to create short, accurate, and cohesive summaries of longer
videos. Despite the existence of various video summarization datasets, a notable limitation …