Contrastive vision-language pre-training with limited resources

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：198 相关文章所有 9 个版本

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

被引用次数：24 相关文章所有 2 个版本

[PDF] researchgate.net

[PDF][PDF] Large-scale domain-specific pretraining for biomedical vision-language processing

S Zhang, Y Xu, N Usuyama, J Bagga… - arXiv preprint arXiv …, 2023 - researchgate.net

Contrastive pretraining on parallel image-text data has attained great success in vision-
language processing (VLP), as exemplified by CLIP and related methods. However, prior …

被引用次数：114 相关文章

[PDF] thecvf.com

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

Cross-modal retrieval: a systematic review of methods and future directions

F Li, L Zhu, T Wang, J Li, Z Zhang, HT Shen - arXiv preprint arXiv …, 2023 - arxiv.org

With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users demanding access to data from various …

被引用次数：11 相关文章所有 3 个版本

[PDF] thecvf.com

A simple framework for text-supervised semantic segmentation

M Yi, Q Cui, H Wu, C Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Text-supervised semantic segmentation is a novel research topic that allows semantic
segments to emerge with image-text contrasting. However, pioneering methods could be …

被引用次数：23 相关文章所有 5 个版本

[PDF] thecvf.com

Domain prompt learning with quaternion networks

Q Cao, Z Xu, Y Chen, C Ma… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Prompt learning has emerged as an effective and data-efficient technique in large Vision-
Language Models (VLMs). However when adapting VLMs to specialized domains such as …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Dynamic contrastive distillation for image-text retrieval

J Rao, L Ding, S Qi, M Fang, Y Liu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

The recent advancement in vision-and-language pretraining (VLP) has significantly
improved the performance of cross-modal image-text retrieval (ITR) systems. However, the …

被引用次数：22 相关文章所有 4 个版本

[PDF] arxiv.org

Cross-modal concept learning and inference for vision-language models

Y Zhang, C Zhang, Y Tang, Z He - Neurocomputing, 2024 - Elsevier

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the
correlation between texts and images, achieving remarkable success on various …

被引用次数：10 相关文章所有 4 个版本

[PDF] arxiv.org

Bdc-adapter: Brownian distance covariance for better vision-language reasoning

Y Zhang, C Zhang, Z Liao, Y Tang, Z He - arXiv preprint arXiv:2309.01256, 2023 - arxiv.org

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have
introduced a new paradigm for learning transferable visual representations. Recently, there …

被引用次数：9 相关文章所有 7 个版本