Learning mask-aware clip representations for zero-shot segmentation

S Jiao, Y Wei, Y Wang, Y Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc
Recently, pre-trained vision-language models have been increasingly used to tackle the
challenging zero-shot segmentation task. Typical solutions follow the paradigm of first …

Preventing zero-shot transfer degradation in continual learning of vision-language models

Z Zheng, M Ma, K Wang, Z Qin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Continual learning (CL) can help pre-trained vision-language models efficiently adapt to
new or under-trained data distributions without re-training. Nevertheless, during the …

Clipood: Generalizing clip to out-of-distributions

Y Shu, X Guo, J Wu, X Wang… - … on Machine Learning, 2023 - proceedings.mlr.press
Abstract Out-of-distribution (OOD) generalization, where the model needs to handle
distribution shifts from training, is a major challenge of machine learning. Contrastive …

Promptrestorer: A prompting image restoration method with degradation perception

C Wang, J Pan, W Wang, J Dong… - Advances in …, 2023 - proceedings.neurips.cc
We show that raw degradation features can effectively guide deep restoration models,
providing accurate degradation priors to facilitate better restoration. While networks that do …

Waffling around for performance: Visual classification with random words and broad concepts

K Roth, JM Kim, A Koepke, O Vinyals… - Proceedings of the …, 2023 - openaccess.thecvf.com
The visual classification performance of vision-language models such as CLIP has been
shown to benefit from additional semantic knowledge from large language models (LLMs) …

What can human sketches do for object detection?

PN Chowdhury, AK Bhunia, A Sain… - Proceedings of the …, 2023 - openaccess.thecvf.com
Sketches are highly expressive, inherently capturing subjective and fine-grained visual
cues. The exploration of such innate properties of human sketches has, however, been …

Octopus: Embodied vision-language programmer from environmental feedback

J Yang, Y Dong, S Liu, B Li, Z Wang, C Jiang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large vision-language models (VLMs) have achieved substantial progress in multimodal
perception and reasoning. Furthermore, when seamlessly integrated into an embodied …

Flip: Cross-domain face anti-spoofing with language guidance

K Srivatsan, M Naseer… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Face anti-spoofing (FAS) or presentation attack detection is an essential component of face
recognition systems deployed in security-critical applications. Existing FAS methods have …

Swapprompt: Test-time prompt adaptation for vision-language models

X Ma, J Zhang, S Guo, W Xu - Advances in Neural …, 2024 - proceedings.neurips.cc
Test-time adaptation (TTA) is a special and practical setting in unsupervised domain
adaptation, which allows a pre-trained model in a source domain to adapt to unlabeled test …

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding

Z Guo, Y Tang, R Zhang, D Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Understanding 3D scenes from multi-view inputs has been proven to alleviate the view
discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …