Cross-modal retrieval: a systematic review of methods and future directions
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users seeking access to data across various …
methods struggle to meet the needs of users seeking access to data across various …
Vision-language models for vision tasks: A survey
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …
(DNNs) training, and they usually train a DNN for each single visual recognition task …
Self-supervised multimodal learning: A survey
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …
modalities, has achieved substantial progress in the supervised regime in recent years …
[PDF][PDF] Large-scale domain-specific pretraining for biomedical vision-language processing
Contrastive pretraining on parallel image-text data has attained great success in vision-
language processing (VLP), as exemplified by CLIP and related methods. However, prior …
language processing (VLP), as exemplified by CLIP and related methods. However, prior …
Binding touch to everything: Learning unified multimodal tactile representations
The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …
computational systems. However multimodal learning with touch remains challenging due to …
A simple framework for text-supervised semantic segmentation
Text-supervised semantic segmentation is a novel research topic that allows semantic
segments to emerge with image-text contrasting. However, pioneering methods could be …
segments to emerge with image-text contrasting. However, pioneering methods could be …
TF-FAS: twofold-element fine-grained semantic guidance for generalizable face anti-spoofing
Generalizable Face anti-spoofing (FAS) approaches have recently garnered considerable
attention due to their robustness in unseen scenarios. Some recent methods incorporate …
attention due to their robustness in unseen scenarios. Some recent methods incorporate …
Domain prompt learning with quaternion networks
Prompt learning has emerged as an effective and data-efficient technique in large Vision-
Language Models (VLMs). However when adapting VLMs to specialized domains such as …
Language Models (VLMs). However when adapting VLMs to specialized domains such as …
Dynamic contrastive distillation for image-text retrieval
The recent advancement in vision-and-language pretraining (VLP) has significantly
improved the performance of cross-modal image-text retrieval (ITR) systems. However, the …
improved the performance of cross-modal image-text retrieval (ITR) systems. However, the …
Cross-modal concept learning and inference for vision-language models
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the
correlation between texts and images, achieving remarkable success on various …
correlation between texts and images, achieving remarkable success on various …