Conditional prompt learning for vision-language models

S Goyal, A Kumar, S Garg, Z Kolter… - Proceedings of the …, 2023 - openaccess.thecvf.com

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety
of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have …

被引用次数：74 相关文章所有 5 个版本

[PDF] thecvf.com

Clip for all things zero-shot sketch-based image retrieval, fine-grained or not

A Sain, AK Bhunia, PN Chowdhury… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We
are largely inspired by recent advances on foundation models and the unparalleled …

被引用次数：55 相关文章所有 7 个版本

[PDF] neurips.cc

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models

G Zheng, B Yang, J Tang, HY Zhou… - Advances in Neural …, 2023 - proceedings.neurips.cc

A long-standing goal of AI systems is to perform complex multimodal reasoning like humans.
Recently, large language models (LLMs) have made remarkable strides in such multi-step …

被引用次数：25 相关文章所有 5 个版本

[PDF] neurips.cc

What makes good examples for visual in-context learning?

Y Zhang, K Zhou, Z Liu - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Large vision models with billions of parameters and trained on broad data have great
potential in numerous downstream applications. However, these models are typically difficult …

被引用次数：51 相关文章所有 5 个版本

[PDF] neurips.cc

Cheap and quick: Efficient vision-language instruction tuning for large language models

G Luo, Y Zhou, T Ren, S Chen… - Advances in Neural …, 2024 - proceedings.neurips.cc

Recently, growing interest has been aroused in extending the multimodal capability of large
language models (LLMs), eg, vision-language (VL) learning, which is regarded as the next …

被引用次数：56 相关文章所有 6 个版本

[PDF] thecvf.com

Pic2word: Mapping pictures to words for zero-shot composed image retrieval

K Saito, K Sohn, X Zhang, CL Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of CIR models …

被引用次数：53 相关文章所有 9 个版本

[PDF] neurips.cc

Elevater: A benchmark and toolkit for evaluating language-augmented visual models

C Li, H Liu, L Li, P Zhang, J Aneja… - Advances in …, 2022 - proceedings.neurips.cc

Learning visual representations from natural language supervision has recently shown great
promise in a number of pioneering works. In general, these language-augmented visual …

被引用次数：105 相关文章所有 8 个版本

[PDF] thecvf.com

Clipn for zero-shot ood detection: Teaching clip to say no

H Wang, Y Li, H Yao, X Li - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Out-of-distribution (OOD) detection refers to training the model on in-distribution (ID)
dataset to classify if the input images come from unknown classes. Considerable efforts …

被引用次数：37 相关文章所有 6 个版本

[PDF] thecvf.com

Iterative prompt learning for unsupervised backlit image enhancement

Z Liang, C Li, S Zhou, R Feng… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-
LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel …

被引用次数：46 相关文章所有 5 个版本

[PDF] arxiv.org

Unified vision and language prompt learning

Y Zang, W Li, K Zhou, C Huang, CC Loy - arXiv preprint arXiv:2210.07225, 2022 - arxiv.org

Prompt tuning, a parameter-and data-efficient transfer learning paradigm that tunes only a
small number of parameters in a model's input space, has become a trend in the vision …

被引用次数：104 相关文章所有 3 个版本