Finetune like you pretrain: Improved finetuning of zero-shot vision models

S Goyal, A Kumar, S Garg, Z Kolter… - Proceedings of the …, 2023 - openaccess.thecvf.com
Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety
of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have …

Clip for all things zero-shot sketch-based image retrieval, fine-grained or not

A Sain, AK Bhunia, PN Chowdhury… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We
are largely inspired by recent advances on foundation models and the unparalleled …

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models

G Zheng, B Yang, J Tang, HY Zhou… - Advances in Neural …, 2023 - proceedings.neurips.cc
A long-standing goal of AI systems is to perform complex multimodal reasoning like humans.
Recently, large language models (LLMs) have made remarkable strides in such multi-step …

What makes good examples for visual in-context learning?

Y Zhang, K Zhou, Z Liu - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Large vision models with billions of parameters and trained on broad data have great
potential in numerous downstream applications. However, these models are typically difficult …

Cheap and quick: Efficient vision-language instruction tuning for large language models

G Luo, Y Zhou, T Ren, S Chen… - Advances in Neural …, 2024 - proceedings.neurips.cc
Recently, growing interest has been aroused in extending the multimodal capability of large
language models (LLMs), eg, vision-language (VL) learning, which is regarded as the next …

Pic2word: Mapping pictures to words for zero-shot composed image retrieval

K Saito, K Sohn, X Zhang, CL Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of CIR models …

Elevater: A benchmark and toolkit for evaluating language-augmented visual models

C Li, H Liu, L Li, P Zhang, J Aneja… - Advances in …, 2022 - proceedings.neurips.cc
Learning visual representations from natural language supervision has recently shown great
promise in a number of pioneering works. In general, these language-augmented visual …

Clipn for zero-shot ood detection: Teaching clip to say no

H Wang, Y Li, H Yao, X Li - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Abstract Out-of-distribution (OOD) detection refers to training the model on in-distribution (ID)
dataset to classify if the input images come from unknown classes. Considerable efforts …

Iterative prompt learning for unsupervised backlit image enhancement

Z Liang, C Li, S Zhou, R Feng… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-
LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel …

Unified vision and language prompt learning

Y Zang, W Li, K Zhou, C Huang, CC Loy - arXiv preprint arXiv:2210.07225, 2022 - arxiv.org
Prompt tuning, a parameter-and data-efficient transfer learning paradigm that tunes only a
small number of parameters in a model's input space, has become a trend in the vision …