Learning mask-aware clip representations for zero-shot segmentation
Recently, pre-trained vision-language models have been increasingly used to tackle the
challenging zero-shot segmentation task. Typical solutions follow the paradigm of first …
challenging zero-shot segmentation task. Typical solutions follow the paradigm of first …
Preventing zero-shot transfer degradation in continual learning of vision-language models
Continual learning (CL) can help pre-trained vision-language models efficiently adapt to
new or under-trained data distributions without re-training. Nevertheless, during the …
new or under-trained data distributions without re-training. Nevertheless, during the …
Clipood: Generalizing clip to out-of-distributions
Abstract Out-of-distribution (OOD) generalization, where the model needs to handle
distribution shifts from training, is a major challenge of machine learning. Contrastive …
distribution shifts from training, is a major challenge of machine learning. Contrastive …
Promptrestorer: A prompting image restoration method with degradation perception
We show that raw degradation features can effectively guide deep restoration models,
providing accurate degradation priors to facilitate better restoration. While networks that do …
providing accurate degradation priors to facilitate better restoration. While networks that do …
Waffling around for performance: Visual classification with random words and broad concepts
The visual classification performance of vision-language models such as CLIP has been
shown to benefit from additional semantic knowledge from large language models (LLMs) …
shown to benefit from additional semantic knowledge from large language models (LLMs) …
What can human sketches do for object detection?
Sketches are highly expressive, inherently capturing subjective and fine-grained visual
cues. The exploration of such innate properties of human sketches has, however, been …
cues. The exploration of such innate properties of human sketches has, however, been …
Octopus: Embodied vision-language programmer from environmental feedback
Large vision-language models (VLMs) have achieved substantial progress in multimodal
perception and reasoning. Furthermore, when seamlessly integrated into an embodied …
perception and reasoning. Furthermore, when seamlessly integrated into an embodied …
Flip: Cross-domain face anti-spoofing with language guidance
K Srivatsan, M Naseer… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Face anti-spoofing (FAS) or presentation attack detection is an essential component of face
recognition systems deployed in security-critical applications. Existing FAS methods have …
recognition systems deployed in security-critical applications. Existing FAS methods have …
Swapprompt: Test-time prompt adaptation for vision-language models
Test-time adaptation (TTA) is a special and practical setting in unsupervised domain
adaptation, which allows a pre-trained model in a source domain to adapt to unlabeled test …
adaptation, which allows a pre-trained model in a source domain to adapt to unlabeled test …
Viewrefer: Grasp the multi-view knowledge for 3d visual grounding
Understanding 3D scenes from multi-view inputs has been proven to alleviate the view
discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …
discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …