Muchomusic: Evaluating music understanding in multimodal audio-language models

B Weck, I Manco, E Benetos, E Quinton… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal models that jointly process audio and language hold great promise in audio
understanding and are increasingly being adopted in the music domain. By allowing users …

Multimodal task vectors enable many-shot multimodal in-context learning

B Huang, C Mitra, A Arbelle, L Karlinsky… - arXiv preprint arXiv …, 2024 - arxiv.org
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning
suggests that in-context learning (ICL) with many examples can be promising for learning …

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

L Qin, Q Chen, H Fei, Z Chen, M Li, W Che - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved
notable success, which is capable of achieving superior performance across various tasks …

Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arXiv preprint arXiv:2501.02189, 2025 - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

Glov: Guided large language models as implicit optimizers for vision language models

MJ Mirza, M Zhao, Z Mao, S Doveh, W Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs)
to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream …

Enhancing explainability in multimodal large language models using ontological context

J Amara, B König-Ries, S Samuel - arXiv preprint arXiv:2409.18753, 2024 - arxiv.org
Recently, there has been a growing interest in Multimodal Large Language Models (MLLMs)
due to their remarkable potential in various tasks integrating different modalities, such as …

SketchAgent: Language-Driven Sequential Sketch Generation

Y Vinker, TR Shaham, K Zheng, A Zhao, JE Fan… - arXiv preprint arXiv …, 2024 - arxiv.org
Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and
visual communication that spans various disciplines. While artificial systems have driven …

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

I Huang, W Lin, MJ Mirza, JA Hansen, S Doveh… - arXiv preprint arXiv …, 2024 - arxiv.org
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and
word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a …

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

H Jia, C Jiang, H Xu, W Ye, M Dong, M Yan… - arXiv preprint arXiv …, 2024 - arxiv.org
As language models continue to scale, Large Language Models (LLMs) have exhibited
emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks …