Knowledge-enhanced dual-stream zero-shot composed image retrieval

Y Suo, F Ma, L Zhu, Y Yang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …

Relative representations enable zero-shot latent space communication

L Moschella, V Maiorca, M Fumero, A Norelli… - arXiv preprint arXiv …, 2022 - arxiv.org
Neural networks embed the geometric structure of a data manifold lying in a high-
dimensional space into latent representations. Ideally, the distribution of the data points in …

Ra-clip: Retrieval augmented contrastive language-image pre-training

CW Xie, S Sun, X Xiong, Y Zheng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) is attracting increasing attention
for its impressive zero-shot recognition performance on different down-stream tasks …

Composed image retrieval with text feedback via multi-grained uncertainty regularization

Y Chen, Z Zheng, W Ji, L Qu, TS Chua - arXiv preprint arXiv:2211.07394, 2022 - arxiv.org
We investigate composed image retrieval with text feedback. Users gradually look for the
target of interest by moving from coarse to fine-grained feedback. However, existing …

Latent space translation via semantic alignment

V Maiorca, L Moschella, A Norelli… - Advances in …, 2024 - proceedings.neurips.cc
While different neural models often exhibit latent spaces that are alike when exposed to
semantically related data, this intrinsic similarity is not always immediately discernible …

Understanding shared speech-text representations

G Wang, K Kastner, A Bapna, Z Chen… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Recently, a number of approaches to train speech models by incorporating text into end-to-
end models have been developed, with Maestro advancing state-of-the-art automatic …

Boosting visual-language models by exploiting hard samples

H Wang, M Huang, R Huang, L Hong, H Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has become the standard for learning
cross-modal representations between images and text. Efforts to improve its capabilities …

Do Vision and Language Encoders Represent the World Similarly?

M Maniparambil, R Akshulakov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Aligned text-image encoders such as CLIP have become the de-facto model for vision-
language tasks. Furthermore modality-specific encoders achieve impressive performances …

Zero-Shot Continuous Prompt Transfer: Generalizing Task Semantics Across Language Models

Z Wu, Y Wu, L Mou - arXiv preprint arXiv:2310.01691, 2023 - arxiv.org
Prompt tuning in natural language processing (NLP) has become an increasingly popular
method for adapting large language models to specific tasks. However, the transferability of …

From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication

I Cannistraci, L Moschella, M Fumero… - arXiv preprint arXiv …, 2023 - arxiv.org
It has been observed that representations learned by distinct neural networks conceal
structural similarities when the models are trained under similar inductive biases. From a …