Foundational models defining a new era in vision: A survey and outlook

M Awais, M Naseer, S Khan, RM Anwer… - arXiv preprint arXiv …, 2023 - arxiv.org
Vision systems to see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Foundational models in medical imaging: A comprehensive survey and future vision

B Azad, R Azad, S Eskandari, A Bozorgpour… - arXiv preprint arXiv …, 2023 - arxiv.org
Foundation models, large-scale, pre-trained deep-learning models adapted to a wide range
of downstream tasks have gained significant interest lately in various deep-learning …

Eyes wide shut? exploring the visual shortcomings of multimodal llms

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com
Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

CY Hsieh, J Zhang, Z Ma… - Advances in neural …, 2024 - proceedings.neurips.cc
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

The all-seeing project: Towards panoptic visual recognition and understanding of the open world

W Wang, M Shi, Q Li, W Wang, Z Huang, L Xing… - arXiv preprint arXiv …, 2023 - arxiv.org
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and
understanding everything in the open world. Using a scalable data engine that incorporates …

Givt: Generative infinite-vocabulary transformers

M Tschannen, C Eastwood, F Mentzer - European Conference on …, 2025 - Springer
Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

Silc: Improving vision language pretraining with self-distillation

MF Naeem, Y Xian, X Zhai, L Hoyer, L Van Gool… - … on Computer Vision, 2025 - Springer
Image-Text pretraining on web-scale image caption datasets has become the default recipe
for open vocabulary classification and retrieval models thanks to the success of CLIP and its …