No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance

V Udandarao, A Prabhu, A Ghosh… - The Thirty-eighth …, 2024 - openreview.net
Web-crawled pretraining datasets underlie the impressive" zero-shot" evaluation
performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

M Wysoczańska, O Siméoni, M Ramamonjisoa… - … on Computer Vision, 2025 - Springer
The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless
interaction with arbitrary text prompts. However, its lack of spatial awareness makes it …

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

D Kurzendörfer, OB Mercea… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audio-visual zero-shot learning methods commonly build on features extracted from pre-
trained models eg video or audio classification models. However existing benchmarks …

Active data curation effectively distills large-scale multimodal models

V Udandarao, N Parthasarathy, MF Naeem… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into
smaller ones. Prior works have explored ever more complex KD strategies involving different …

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

S Schrodi, DT Hoffmann, M Argus, V Fischer… - arXiv preprint arXiv …, 2024 - arxiv.org
Contrastive vision-language models like CLIP have gained popularity for their versatile
applicable learned representations in various downstream tasks. Despite their successes in …

Feature Contamination: Neural Networks Learn Uncorrelated Features and Fail to Generalize

T Zhang, C Zhao, G Chen, Y Jiang, F Chen - arXiv preprint arXiv …, 2024 - arxiv.org
Learning representations that generalize under distribution shifts is critical for building
robust machine learning models. However, despite significant efforts in recent years …

In search of forgotten domain generalization

P Mayilvahanan, RS Zimmermann, T Wiedemer… - arXiv preprint arXiv …, 2024 - arxiv.org
Out-of-Domain (OOD) generalization is the ability of a model trained on one or more
domains to generalize to unseen domains. In the ImageNet era of computer vision …

Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights

X Wen, B Zhao, Y Chen, J Pang, X Qi - arXiv preprint arXiv:2405.21070, 2024 - arxiv.org
Severe data imbalance naturally exists among web-scale vision-language datasets. Despite
this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance …

On the Comparison between Multi-modal and Single-modal Contrastive Learning

W Huang, A Han, Y Chen, Y Cao, Z Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-modal contrastive learning with language supervision has presented a paradigm shift
in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive …

Using drawings and deep neural networks to characterize the building blocks of human visual similarity

K Mukherjee, TT Rogers - Memory & Cognition, 2024 - Springer
Early in life and without special training, human beings discern resemblance between
abstract visual stimuli, such as drawings, and the real-world objects they represent. We used …