What Makes Multimodal In-Context Learning Work?

FB Baldassini, M Shukor, M Cord… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Language Models have demonstrated remarkable performance across
various tasks exhibiting the capacity to swiftly acquire new skills such as through In-Context …

Building and better understanding vision-language models: insights and future directions

H Laurençon, A Marafioti, V Sanh… - arXiv preprint arXiv …, 2024 - arxiv.org
The field of vision-language models (VLMs), which take images and texts as inputs and
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

BT Corradini, M Shukor, P Couairon… - arXiv preprint arXiv …, 2024 - arxiv.org
Foundation models have exhibited unprecedented capabilities in tackling many domains
and tasks. Models such as CLIP are currently widely used to bridge cross-modal …

A Concept-Based Explainability Framework for Large Multimodal Models

J Parekh, P Khayatan, M Shukor, A Newson… - arXiv preprint arXiv …, 2024 - arxiv.org
Large multimodal models (LMMs) combine unimodal encoders and large language models
(LLMs) to perform multimodal tasks. Despite recent advancements towards the …