Vision-language models for vision tasks: A survey
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …
(DNNs) training, and they usually train a DNN for each single visual recognition task …
Momentdiff: Generative video moment retrieval from random to real
Video moment retrieval pursues an efficient and generalized solution to identify the specific
temporal segments within an untrimmed video that correspond to a given language …
temporal segments within an untrimmed video that correspond to a given language …
Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning
While existing backdoor attacks have successfully infected multimodal contrastive learning
models such as CLIP they can be easily countered by specialized backdoor defenses for …
models such as CLIP they can be easily countered by specialized backdoor defenses for …
Knowledge-enhanced dual-stream zero-shot composed image retrieval
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …
target image given a reference image and a description without training on the triplet …
Few-shot adaptation of multi-modal foundation models: A survey
F Liu, T Zhang, W Dai, C Zhang, W Cai, X Zhou… - Artificial Intelligence …, 2024 - Springer
Abstract Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …
Carzero: Cross-attention alignment for radiology zero-shot classification
Abstract The advancement of Zero-Shot Learning in the medical domain has been driven
forward by using pre-trained models on large-scale image-text pairs focusing on image-text …
forward by using pre-trained models on large-scale image-text pairs focusing on image-text …
Retrieval-enhanced contrastive vision-text models
Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art
systems. While they excel at recognizing common generic concepts, they still struggle on …
systems. While they excel at recognizing common generic concepts, they still struggle on …
MoDE: CLIP Data Experts via Clustering
The success of contrastive language-image pretraining (CLIP) relies on the supervision from
the pairing between images and captions which tends to be noisy in web-crawled data. We …
the pairing between images and captions which tends to be noisy in web-crawled data. We …
Heterogeneous Contrastive Learning for Foundation Models and Beyond
In the era of big data and Artificial Intelligence, an emerging paradigm is to utilize contrastive
self-supervised learning to model large-scale heterogeneous data. Many existing foundation …
self-supervised learning to model large-scale heterogeneous data. Many existing foundation …
Anchor-based Robust Finetuning of Vision-Language Models
We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD)
generalization. We address two types of OOD generalization ie i) domain shift such as …
generalization. We address two types of OOD generalization ie i) domain shift such as …