Visual tuning

BXB Yu, J Chang, H Wang, L Liu, S Wang… - ACM Computing …, 2024 - dl.acm.org
Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …

“This is my unicorn, Fluffy”: Personalizing frozen vision-language representations

N Cohen, R Gal, EA Meirom, G Chechik… - European conference on …, 2022 - Springer
Abstract Large Vision & Language models pretrained on web-scale data provide
representations that are invaluable for numerous V &L problems. However, it is unclear how …

Exploring vision-language models for imbalanced learning

Y Wang, Z Yu, J Wang, Q Heng, H Chen, W Ye… - International Journal of …, 2024 - Springer
Vision-language models (VLMs) that use contrastive language-image pre-training have
shown promising zero-shot classification performance. However, their performance on …

In defense of lazy visual grounding for open-vocabulary semantic segmentation

D Kang, M Cho - European Conference on Computer Vision, 2025 - Springer
Abstract We present Lazy Visual Grounding for open-vocabulary semantic segmentation,
which decouples unsupervised object mask discovery from object grounding. Plenty of the …

Generalized logit adjustment: Calibrating fine-tuned models by removing label bias in foundation models

B Zhu, K Tang, Q Sun, H Zhang - Advances in Neural …, 2024 - proceedings.neurips.cc
Foundation models like CLIP allow zero-shot transfer on various tasks without additional
training data. Yet, the zero-shot performance is less competitive than a fully supervised one …

Local and global logit adjustments for long-tailed learning

Y Tao, J Sun, H Yang, L Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multi-expert ensemble models for long-tailed learning typically either learn diverse
generalists from the whole dataset or aggregate specialists on different subsets. However …

Exploring visual interpretability for contrastive language-image pre-training

Y Li, H Wang, Y Duan, H Xu, X Li - arXiv preprint arXiv:2209.07046, 2022 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily
available supervision of natural language. It improves the performance of downstream vision …

Parameter-efficient long-tailed recognition

JX Shi, T Wei, Z Zhou, XY Han, JJ Shao… - arXiv preprint arXiv …, 2023 - arxiv.org
The" pre-training and fine-tuning" paradigm in addressing long-tailed recognition tasks has
sparked significant interest since the emergence of large vision-language models like the …

Improving zero-shot models with label distribution priors

J Kahana, N Cohen, Y Hoshen - arXiv preprint arXiv:2212.00784, 2022 - arxiv.org
Labeling large image datasets with attributes such as facial age or object type is tedious and
sometimes infeasible. Supervised machine learning methods provide a highly accurate …

Imaginarynet: Learning object detectors without real images and annotations

M Ni, Z Huang, K Feng, W Zuo - arXiv preprint arXiv:2210.06886, 2022 - arxiv.org
Without the demand of training in reality, humans can easily detect a known concept simply
based on its language description. Empowering deep learning with this ability undoubtedly …