“This is my unicorn, Fluffy”: Personalizing frozen vision-language representations
Abstract Large Vision & Language models pretrained on web-scale data provide
representations that are invaluable for numerous V &L problems. However, it is unclear how …
representations that are invaluable for numerous V &L problems. However, it is unclear how …
Exploring vision-language models for imbalanced learning
Vision-language models (VLMs) that use contrastive language-image pre-training have
shown promising zero-shot classification performance. However, their performance on …
shown promising zero-shot classification performance. However, their performance on …
In defense of lazy visual grounding for open-vocabulary semantic segmentation
Abstract We present Lazy Visual Grounding for open-vocabulary semantic segmentation,
which decouples unsupervised object mask discovery from object grounding. Plenty of the …
which decouples unsupervised object mask discovery from object grounding. Plenty of the …
Generalized logit adjustment: Calibrating fine-tuned models by removing label bias in foundation models
Foundation models like CLIP allow zero-shot transfer on various tasks without additional
training data. Yet, the zero-shot performance is less competitive than a fully supervised one …
training data. Yet, the zero-shot performance is less competitive than a fully supervised one …
Local and global logit adjustments for long-tailed learning
Multi-expert ensemble models for long-tailed learning typically either learn diverse
generalists from the whole dataset or aggregate specialists on different subsets. However …
generalists from the whole dataset or aggregate specialists on different subsets. However …
Exploring visual interpretability for contrastive language-image pre-training
Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily
available supervision of natural language. It improves the performance of downstream vision …
available supervision of natural language. It improves the performance of downstream vision …
Parameter-efficient long-tailed recognition
The" pre-training and fine-tuning" paradigm in addressing long-tailed recognition tasks has
sparked significant interest since the emergence of large vision-language models like the …
sparked significant interest since the emergence of large vision-language models like the …
Improving zero-shot models with label distribution priors
Labeling large image datasets with attributes such as facial age or object type is tedious and
sometimes infeasible. Supervised machine learning methods provide a highly accurate …
sometimes infeasible. Supervised machine learning methods provide a highly accurate …
Imaginarynet: Learning object detectors without real images and annotations
Without the demand of training in reality, humans can easily detect a known concept simply
based on its language description. Empowering deep learning with this ability undoubtedly …
based on its language description. Empowering deep learning with this ability undoubtedly …