Asif: Coupled data turns unimodal models to multimodal without training

Y Suo, F Ma, L Zhu, Y Yang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Relative representations enable zero-shot latent space communication

L Moschella, V Maiorca, M Fumero, A Norelli… - arXiv preprint arXiv …, 2022 - arxiv.org

Neural networks embed the geometric structure of a data manifold lying in a high-
dimensional space into latent representations. Ideally, the distribution of the data points in …

被引用次数：59 相关文章所有 4 个版本

[PDF] thecvf.com

Ra-clip: Retrieval augmented contrastive language-image pre-training

CW Xie, S Sun, X Xiong, Y Zheng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) is attracting increasing attention
for its impressive zero-shot recognition performance on different down-stream tasks …

被引用次数：24 相关文章所有 5 个版本

[PDF] arxiv.org

Composed image retrieval with text feedback via multi-grained uncertainty regularization

Y Chen, Z Zheng, W Ji, L Qu, TS Chua - arXiv preprint arXiv:2211.07394, 2022 - arxiv.org

We investigate composed image retrieval with text feedback. Users gradually look for the
target of interest by moving from coarse to fine-grained feedback. However, existing …

被引用次数：27 相关文章所有 3 个版本

[PDF] neurips.cc

Latent space translation via semantic alignment

V Maiorca, L Moschella, A Norelli… - Advances in …, 2024 - proceedings.neurips.cc

While different neural models often exhibit latent spaces that are alike when exposed to
semantically related data, this intrinsic similarity is not always immediately discernible …

被引用次数：5 相关文章所有 6 个版本

[PDF] arxiv.org

Understanding shared speech-text representations

G Wang, K Kastner, A Bapna, Z Chen… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Recently, a number of approaches to train speech models by incorporating text into end-to-
end models have been developed, with Maestro advancing state-of-the-art automatic …

被引用次数：11 相关文章所有 3 个版本

[PDF] arxiv.org

Boosting visual-language models by exploiting hard samples

H Wang, M Huang, R Huang, L Hong, H Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

Contrastive Language-Image Pre-training (CLIP) has become the standard for learning
cross-modal representations between images and text. Efforts to improve its capabilities …

被引用次数：5 相关文章所有 2 个版本

[PDF] thecvf.com

Do Vision and Language Encoders Represent the World Similarly?

M Maniparambil, R Akshulakov… - Proceedings of the …, 2024 - openaccess.thecvf.com

Aligned text-image encoders such as CLIP have become the de-facto model for vision-
language tasks. Furthermore modality-specific encoders achieve impressive performances …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Zero-Shot Continuous Prompt Transfer: Generalizing Task Semantics Across Language Models

Z Wu, Y Wu, L Mou - arXiv preprint arXiv:2310.01691, 2023 - arxiv.org

Prompt tuning in natural language processing (NLP) has become an increasingly popular
method for adapting large language models to specific tasks. However, the transferability of …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication

I Cannistraci, L Moschella, M Fumero… - arXiv preprint arXiv …, 2023 - arxiv.org

It has been observed that representations learned by distinct neural networks conceal
structural similarities when the models are trained under similar inductive biases. From a …

被引用次数：3 相关文章所有 4 个版本