Mm-llms: Recent advances in multimodal large language models
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …
A comprehensive review of multimodal large language models: Performance and challenges across different tasks
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Elysium: Exploring object-level perception in videos via mllm
H Wang, Y Ye, Y Wang, Y Nie, C Huang - European Conference on …, 2025 - Springer
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated their ability to
perceive objects in still images, but their application in video-related tasks, such as object …
perceive objects in still images, but their application in video-related tasks, such as object …
Training-free video temporal grounding using large-scale pre-trained models
Video temporal grounding aims to identify video segments within untrimmed videos that are
most relevant to a given natural language query. Existing video temporal localization models …
most relevant to a given natural language query. Existing video temporal localization models …
Language-driven visual consensus for zero-shot semantic segmentation
The pre-trained vision-language model, exemplified by CLIP [1], advances zero-shot
semantic segmentation by aligning visual features with class embeddings through a …
semantic segmentation by aligning visual features with class embeddings through a …
The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio
Recent advancements in large multimodal models (LMMs) have significantly enhanced
performance across diverse tasks, with ongoing efforts to further integrate additional …
performance across diverse tasks, with ongoing efforts to further integrate additional …
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …
Unifiedmllm: Enabling unified representation for multi-modal multi-tasks with large language model
Significant advancements has recently been achieved in the field of multi-modal large
language models (MLLMs), demonstrating their remarkable capabilities in understanding …
language models (MLLMs), demonstrating their remarkable capabilities in understanding …
V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning
Video summarization aims to create short, accurate, and cohesive summaries of longer
videos. Despite the existence of various video summarization datasets, a notable limitation …
videos. Despite the existence of various video summarization datasets, a notable limitation …