Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang, HT Shen - arXiv preprint arXiv …, 2023 - arxiv.org
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Progressive spatio-temporal prototype matching for text-video retrieval

P Li, CW Xie, L Zhao, H Xie, J Ge… - Proceedings of the …, 2023 - openaccess.thecvf.com
The performance of text-video retrieval has been significantly improved by vision-language
cross-modal learning schemes. The typical solution is to directly align the global video-level …

Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models

D Lu, Z Wang, T Wang, W Guan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision-language pre-training (VLP) models have shown vulnerability to adversarial
examples in multimodal tasks. Furthermore, malicious adversaries can be deliberately …

Re-mine, learn and reason: Exploring the cross-modal semantic correlations for language-guided hoi detection

Y Cao, Q Tang, F Yang, X Su, S You… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Human-Object Interaction (HOI) detection is a challenging computer vision task that
requires visual models to address the complex interactive relationship between humans and …

Lexlip: Lexicon-bottlenecked language-image pre-training for large-scale image-text sparse retrieval

Z Luo, P Zhao, C Xu, X Geng, T Shen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from
the other modality. The conventional dense retrieval paradigm relies on encoding images …

Detecting any human-object interaction relationship: Universal hoi detector with spatial prompt learning on foundation models

Y Cao, Q Tang, X Su, S Chen, S You… - Advances in Neural …, 2023 - proceedings.neurips.cc
Human-object interaction (HOI) detection aims to comprehend the intricate relationships
between humans and objects, predicting triplets, and serving as the foundation for …

Boosting transferability in vision-language attacks via diversification along the intersection region of adversarial trajectory

S Gao, X Jia, X Ren, I Tsang, Q Guo - European Conference on Computer …, 2025 - Springer
Vision-language pre-training (VLP) models exhibit remarkable capabilities in
comprehending both images and text, yet they remain susceptible to multimodal adversarial …

A large cross-modal video retrieval dataset with reading comprehension

W Wu, Y Zhao, Z Li, J Li, H Zhou, MZ Shou, X Bai - Pattern Recognition, 2025 - Elsevier
Most existing cross-modal language-to-video retrieval (VR) research focuses on single-
modal input from video, ie, visual representation, while the text is omnipresent in human …

Neuron-based spiking transmission and reasoning network for robust image-text retrieval

W Li, Z Ma, LJ Deng, X Fan… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Most of the image-text retrieval methods carry out accurate results using fine-grained
features for feature alignment. However, extracting the robustness features while …

Efficient token-guided image-text retrieval with consistent multimodal contrastive training

C Liu, Y Zhang, H Wang, W Chen… - … on Image Processing, 2023 - ieeexplore.ieee.org
Image-text retrieval is a central problem for understanding the semantic relationship
between vision and language, and serves as the basis for various visual and language …