Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Momentdiff: Generative video moment retrieval from random to real

P Li, CW Xie, H Xie, L Zhao, L Zhang… - Advances in neural …, 2024 - proceedings.neurips.cc
Video moment retrieval pursues an efficient and generalized solution to identify the specific
temporal segments within an untrimmed video that correspond to a given language …

Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning

S Liang, M Zhu, A Liu, B Wu, X Cao… - Proceedings of the …, 2024 - openaccess.thecvf.com
While existing backdoor attacks have successfully infected multimodal contrastive learning
models such as CLIP they can be easily countered by specialized backdoor defenses for …

Knowledge-enhanced dual-stream zero-shot composed image retrieval

Y Suo, F Ma, L Zhu, Y Yang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …

Few-shot adaptation of multi-modal foundation models: A survey

F Liu, T Zhang, W Dai, C Zhang, W Cai, X Zhou… - Artificial Intelligence …, 2024 - Springer
Abstract Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …

Carzero: Cross-attention alignment for radiology zero-shot classification

H Lai, Q Yao, Z Jiang, R Wang, Z He… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract The advancement of Zero-Shot Learning in the medical domain has been driven
forward by using pre-trained models on large-scale image-text pairs focusing on image-text …

Retrieval-enhanced contrastive vision-text models

A Iscen, M Caron, A Fathi, C Schmid - arXiv preprint arXiv:2306.07196, 2023 - arxiv.org
Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art
systems. While they excel at recognizing common generic concepts, they still struggle on …

MoDE: CLIP Data Experts via Clustering

J Ma, PY Huang, S Xie, SW Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
The success of contrastive language-image pretraining (CLIP) relies on the supervision from
the pairing between images and captions which tends to be noisy in web-crawled data. We …

Heterogeneous Contrastive Learning for Foundation Models and Beyond

L Zheng, B Jing, Z Li, H Tong, J He - Proceedings of the 30th ACM …, 2024 - dl.acm.org
In the era of big data and Artificial Intelligence, an emerging paradigm is to utilize contrastive
self-supervised learning to model large-scale heterogeneous data. Many existing foundation …

Anchor-based Robust Finetuning of Vision-Language Models

J Han, Z Lin, Z Sun, Y Gao, K Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com
We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD)
generalization. We address two types of OOD generalization ie i) domain shift such as …