Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

[PDF][PDF] Large-scale domain-specific pretraining for biomedical vision-language processing

S Zhang, Y Xu, N Usuyama, J Bagga… - arXiv preprint arXiv …, 2023 - researchgate.net
Contrastive pretraining on parallel image-text data has attained great success in vision-
language processing (VLP), as exemplified by CLIP and related methods. However, prior …

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com
The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

Cross-modal retrieval: a systematic review of methods and future directions

F Li, L Zhu, T Wang, J Li, Z Zhang, HT Shen - arXiv preprint arXiv …, 2023 - arxiv.org
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users demanding access to data from various …

A simple framework for text-supervised semantic segmentation

M Yi, Q Cui, H Wu, C Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Text-supervised semantic segmentation is a novel research topic that allows semantic
segments to emerge with image-text contrasting. However, pioneering methods could be …

Domain prompt learning with quaternion networks

Q Cao, Z Xu, Y Chen, C Ma… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Prompt learning has emerged as an effective and data-efficient technique in large Vision-
Language Models (VLMs). However when adapting VLMs to specialized domains such as …

Dynamic contrastive distillation for image-text retrieval

J Rao, L Ding, S Qi, M Fang, Y Liu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The recent advancement in vision-and-language pretraining (VLP) has significantly
improved the performance of cross-modal image-text retrieval (ITR) systems. However, the …

Cross-modal concept learning and inference for vision-language models

Y Zhang, C Zhang, Y Tang, Z He - Neurocomputing, 2024 - Elsevier
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the
correlation between texts and images, achieving remarkable success on various …

Bdc-adapter: Brownian distance covariance for better vision-language reasoning

Y Zhang, C Zhang, Z Liao, Y Tang, Z He - arXiv preprint arXiv:2309.01256, 2023 - arxiv.org
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have
introduced a new paradigm for learning transferable visual representations. Recently, there …