Ra-clip: Retrieval augmented contrastive language-image pre-training

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：228 相关文章所有 9 个版本

[PDF] neurips.cc

Momentdiff: Generative video moment retrieval from random to real

P Li, CW Xie, H Xie, L Zhao, L Zhang… - Advances in neural …, 2024 - proceedings.neurips.cc

Video moment retrieval pursues an efficient and generalized solution to identify the specific
temporal segments within an untrimmed video that correspond to a given language …

被引用次数：40 相关文章所有 6 个版本

[PDF] thecvf.com

Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning

S Liang, M Zhu, A Liu, B Wu, X Cao… - Proceedings of the …, 2024 - openaccess.thecvf.com

While existing backdoor attacks have successfully infected multimodal contrastive learning
models such as CLIP they can be easily countered by specialized backdoor defenses for …

被引用次数：17 相关文章所有 3 个版本

[PDF] thecvf.com

Knowledge-enhanced dual-stream zero-shot composed image retrieval

Y Suo, F Ma, L Zhu, Y Yang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …

被引用次数：7 相关文章所有 3 个版本

[PDF] springer.com

Few-shot adaptation of multi-modal foundation models: A survey

F Liu, T Zhang, W Dai, C Zhang, W Cai, X Zhou… - Artificial Intelligence …, 2024 - Springer

Abstract Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …

被引用次数：13 相关文章所有 4 个版本

[PDF] thecvf.com

Carzero: Cross-attention alignment for radiology zero-shot classification

H Lai, Q Yao, Z Jiang, R Wang, Z He… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract The advancement of Zero-Shot Learning in the medical domain has been driven
forward by using pre-trained models on large-scale image-text pairs focusing on image-text …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Retrieval-enhanced contrastive vision-text models

A Iscen, M Caron, A Fathi, C Schmid - arXiv preprint arXiv:2306.07196, 2023 - arxiv.org

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art
systems. While they excel at recognizing common generic concepts, they still struggle on …

被引用次数：16 相关文章所有 3 个版本

[PDF] thecvf.com

MoDE: CLIP Data Experts via Clustering

J Ma, PY Huang, S Xie, SW Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

The success of contrastive language-image pretraining (CLIP) relies on the supervision from
the pairing between images and captions which tends to be noisy in web-crawled data. We …

被引用次数：3 相关文章所有 4 个版本

[PDF] acm.org

Heterogeneous Contrastive Learning for Foundation Models and Beyond

L Zheng, B Jing, Z Li, H Tong, J He - Proceedings of the 30th ACM …, 2024 - dl.acm.org

In the era of big data and Artificial Intelligence, an emerging paradigm is to utilize contrastive
self-supervised learning to model large-scale heterogeneous data. Many existing foundation …

被引用次数：2 相关文章所有 2 个版本

[PDF] thecvf.com

Anchor-based Robust Finetuning of Vision-Language Models

J Han, Z Lin, Z Sun, Y Gao, K Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com

We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD)
generalization. We address two types of OOD generalization ie i) domain shift such as …

被引用次数：1 相关文章所有 3 个版本