Finetune like you pretrain: Improved finetuning of zero-shot vision models
Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety
of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have …
of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have …
Clip for all things zero-shot sketch-based image retrieval, fine-grained or not
In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We
are largely inspired by recent advances on foundation models and the unparalleled …
are largely inspired by recent advances on foundation models and the unparalleled …
Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models
G Zheng, B Yang, J Tang, HY Zhou… - Advances in Neural …, 2023 - proceedings.neurips.cc
A long-standing goal of AI systems is to perform complex multimodal reasoning like humans.
Recently, large language models (LLMs) have made remarkable strides in such multi-step …
Recently, large language models (LLMs) have made remarkable strides in such multi-step …
What makes good examples for visual in-context learning?
Large vision models with billions of parameters and trained on broad data have great
potential in numerous downstream applications. However, these models are typically difficult …
potential in numerous downstream applications. However, these models are typically difficult …
Cheap and quick: Efficient vision-language instruction tuning for large language models
Recently, growing interest has been aroused in extending the multimodal capability of large
language models (LLMs), eg, vision-language (VL) learning, which is regarded as the next …
language models (LLMs), eg, vision-language (VL) learning, which is regarded as the next …
Pic2word: Mapping pictures to words for zero-shot composed image retrieval
Abstract In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of CIR models …
describe their intended target. Existing methods rely on supervised learning of CIR models …
Elevater: A benchmark and toolkit for evaluating language-augmented visual models
Learning visual representations from natural language supervision has recently shown great
promise in a number of pioneering works. In general, these language-augmented visual …
promise in a number of pioneering works. In general, these language-augmented visual …
Clipn for zero-shot ood detection: Teaching clip to say no
Abstract Out-of-distribution (OOD) detection refers to training the model on in-distribution (ID)
dataset to classify if the input images come from unknown classes. Considerable efforts …
dataset to classify if the input images come from unknown classes. Considerable efforts …
Iterative prompt learning for unsupervised backlit image enhancement
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-
LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel …
LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel …
Unified vision and language prompt learning
Prompt tuning, a parameter-and data-efficient transfer learning paradigm that tunes only a
small number of parameters in a model's input space, has become a trend in the vision …
small number of parameters in a model's input space, has become a trend in the vision …