No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance
Web-crawled pretraining datasets underlie the impressive" zero-shot" evaluation
performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …
performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …
CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation
The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless
interaction with arbitrary text prompts. However, its lack of spatial awareness makes it …
interaction with arbitrary text prompts. However, its lack of spatial awareness makes it …
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
D Kurzendörfer, OB Mercea… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audio-visual zero-shot learning methods commonly build on features extracted from pre-
trained models eg video or audio classification models. However existing benchmarks …
trained models eg video or audio classification models. However existing benchmarks …
Active data curation effectively distills large-scale multimodal models
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into
smaller ones. Prior works have explored ever more complex KD strategies involving different …
smaller ones. Prior works have explored ever more complex KD strategies involving different …
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning
Contrastive vision-language models like CLIP have gained popularity for their versatile
applicable learned representations in various downstream tasks. Despite their successes in …
applicable learned representations in various downstream tasks. Despite their successes in …
Feature Contamination: Neural Networks Learn Uncorrelated Features and Fail to Generalize
Learning representations that generalize under distribution shifts is critical for building
robust machine learning models. However, despite significant efforts in recent years …
robust machine learning models. However, despite significant efforts in recent years …
In search of forgotten domain generalization
Out-of-Domain (OOD) generalization is the ability of a model trained on one or more
domains to generalize to unseen domains. In the ImageNet era of computer vision …
domains to generalize to unseen domains. In the ImageNet era of computer vision …
Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights
Severe data imbalance naturally exists among web-scale vision-language datasets. Despite
this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance …
this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance …
On the Comparison between Multi-modal and Single-modal Contrastive Learning
Multi-modal contrastive learning with language supervision has presented a paradigm shift
in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive …
in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive …
Using drawings and deep neural networks to characterize the building blocks of human visual similarity
K Mukherjee, TT Rogers - Memory & Cognition, 2024 - Springer
Early in life and without special training, human beings discern resemblance between
abstract visual stimuli, such as drawings, and the real-world objects they represent. We used …
abstract visual stimuli, such as drawings, and the real-world objects they represent. We used …