Muchomusic: Evaluating music understanding in multimodal audio-language models
Multimodal models that jointly process audio and language hold great promise in audio
understanding and are increasingly being adopted in the music domain. By allowing users …
understanding and are increasingly being adopted in the music domain. By allowing users …
Multimodal task vectors enable many-shot multimodal in-context learning
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning
suggests that in-context learning (ICL) with many examples can be promising for learning …
suggests that in-context learning (ICL) with many examples can be promising for learning …
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved
notable success, which is capable of achieving superior performance across various tasks …
notable success, which is capable of achieving superior performance across various tasks …
Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey
Z Li, X Wu, H Du, H Nghiem, G Shi - arXiv preprint arXiv:2501.02189, 2025 - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …
at the intersection of computer vision and natural language processing, enabling machines …
Glov: Guided large language models as implicit optimizers for vision language models
In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs)
to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream …
to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream …
Enhancing explainability in multimodal large language models using ontological context
Recently, there has been a growing interest in Multimodal Large Language Models (MLLMs)
due to their remarkable potential in various tasks integrating different modalities, such as …
due to their remarkable potential in various tasks integrating different modalities, such as …
SketchAgent: Language-Driven Sequential Sketch Generation
Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and
visual communication that spans various disciplines. While artificial systems have driven …
visual communication that spans various disciplines. While artificial systems have driven …
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and
word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a …
word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a …
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
As language models continue to scale, Large Language Models (LLMs) have exhibited
emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks …
emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks …