Cross-modal retrieval: a systematic review of methods and future directions
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users seeking access to data across various …
methods struggle to meet the needs of users seeking access to data across various …
Progressive spatio-temporal prototype matching for text-video retrieval
The performance of text-video retrieval has been significantly improved by vision-language
cross-modal learning schemes. The typical solution is to directly align the global video-level …
cross-modal learning schemes. The typical solution is to directly align the global video-level …
Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models
Vision-language pre-training (VLP) models have shown vulnerability to adversarial
examples in multimodal tasks. Furthermore, malicious adversaries can be deliberately …
examples in multimodal tasks. Furthermore, malicious adversaries can be deliberately …
Re-mine, learn and reason: Exploring the cross-modal semantic correlations for language-guided hoi detection
Abstract Human-Object Interaction (HOI) detection is a challenging computer vision task that
requires visual models to address the complex interactive relationship between humans and …
requires visual models to address the complex interactive relationship between humans and …
Lexlip: Lexicon-bottlenecked language-image pre-training for large-scale image-text sparse retrieval
Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from
the other modality. The conventional dense retrieval paradigm relies on encoding images …
the other modality. The conventional dense retrieval paradigm relies on encoding images …
Detecting any human-object interaction relationship: Universal hoi detector with spatial prompt learning on foundation models
Y Cao, Q Tang, X Su, S Chen, S You… - Advances in Neural …, 2023 - proceedings.neurips.cc
Human-object interaction (HOI) detection aims to comprehend the intricate relationships
between humans and objects, predicting triplets, and serving as the foundation for …
between humans and objects, predicting triplets, and serving as the foundation for …
Boosting transferability in vision-language attacks via diversification along the intersection region of adversarial trajectory
Vision-language pre-training (VLP) models exhibit remarkable capabilities in
comprehending both images and text, yet they remain susceptible to multimodal adversarial …
comprehending both images and text, yet they remain susceptible to multimodal adversarial …
A large cross-modal video retrieval dataset with reading comprehension
Most existing cross-modal language-to-video retrieval (VR) research focuses on single-
modal input from video, ie, visual representation, while the text is omnipresent in human …
modal input from video, ie, visual representation, while the text is omnipresent in human …
Neuron-based spiking transmission and reasoning network for robust image-text retrieval
Most of the image-text retrieval methods carry out accurate results using fine-grained
features for feature alignment. However, extracting the robustness features while …
features for feature alignment. However, extracting the robustness features while …
Efficient token-guided image-text retrieval with consistent multimodal contrastive training
Image-text retrieval is a central problem for understanding the semantic relationship
between vision and language, and serves as the basis for various visual and language …
between vision and language, and serves as the basis for various visual and language …