Foundational models defining a new era in vision: A survey and outlook
Vision systems to see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Foundational models in medical imaging: A comprehensive survey and future vision
Foundation models, large-scale, pre-trained deep-learning models adapted to a wide range
of downstream tasks have gained significant interest lately in various deep-learning …
of downstream tasks have gained significant interest lately in various deep-learning …
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …
stem from the powerful reasoning abilities of large language models (LLMs). However the …
Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …
understanding of vision-language models have permeated the machine learning ecosystem …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Paligemma: A versatile 3b vlm for transfer
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
The all-seeing project: Towards panoptic visual recognition and understanding of the open world
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and
understanding everything in the open world. Using a scalable data engine that incorporates …
understanding everything in the open world. Using a scalable data engine that incorporates …
Givt: Generative infinite-vocabulary transformers
Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …
vector sequences with real-valued entries, instead of discrete tokens from a finite …
A simple recipe for contrastively pre-training video-first encoders beyond 16 frames
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …
dependencies. To this end we explore video-first architectures building on the common …
Silc: Improving vision language pretraining with self-distillation
Image-Text pretraining on web-scale image caption datasets has become the default recipe
for open vocabulary classification and retrieval models thanks to the success of CLIP and its …
for open vocabulary classification and retrieval models thanks to the success of CLIP and its …