Knowledge-enhanced dual-stream zero-shot composed image retrieval
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …
target image given a reference image and a description without training on the triplet …
Relative representations enable zero-shot latent space communication
Neural networks embed the geometric structure of a data manifold lying in a high-
dimensional space into latent representations. Ideally, the distribution of the data points in …
dimensional space into latent representations. Ideally, the distribution of the data points in …
Ra-clip: Retrieval augmented contrastive language-image pre-training
Abstract Contrastive Language-Image Pre-training (CLIP) is attracting increasing attention
for its impressive zero-shot recognition performance on different down-stream tasks …
for its impressive zero-shot recognition performance on different down-stream tasks …
Composed image retrieval with text feedback via multi-grained uncertainty regularization
We investigate composed image retrieval with text feedback. Users gradually look for the
target of interest by moving from coarse to fine-grained feedback. However, existing …
target of interest by moving from coarse to fine-grained feedback. However, existing …
Latent space translation via semantic alignment
While different neural models often exhibit latent spaces that are alike when exposed to
semantically related data, this intrinsic similarity is not always immediately discernible …
semantically related data, this intrinsic similarity is not always immediately discernible …
Understanding shared speech-text representations
Recently, a number of approaches to train speech models by incorporating text into end-to-
end models have been developed, with Maestro advancing state-of-the-art automatic …
end models have been developed, with Maestro advancing state-of-the-art automatic …
Boosting visual-language models by exploiting hard samples
Contrastive Language-Image Pre-training (CLIP) has become the standard for learning
cross-modal representations between images and text. Efforts to improve its capabilities …
cross-modal representations between images and text. Efforts to improve its capabilities …
Do Vision and Language Encoders Represent the World Similarly?
M Maniparambil, R Akshulakov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Aligned text-image encoders such as CLIP have become the de-facto model for vision-
language tasks. Furthermore modality-specific encoders achieve impressive performances …
language tasks. Furthermore modality-specific encoders achieve impressive performances …
Zero-Shot Continuous Prompt Transfer: Generalizing Task Semantics Across Language Models
Prompt tuning in natural language processing (NLP) has become an increasingly popular
method for adapting large language models to specific tasks. However, the transferability of …
method for adapting large language models to specific tasks. However, the transferability of …
From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication
It has been observed that representations learned by distinct neural networks conceal
structural similarities when the models are trained under similar inductive biases. From a …
structural similarities when the models are trained under similar inductive biases. From a …