Ts2-net: Token shift and selection transformer for text-video retrieval

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

被引用次数：315 相关文章所有 2 个版本

[PDF] neurips.cc

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2024 - proceedings.neurips.cc

Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

被引用次数：136 相关文章所有 7 个版本

[PDF] thecvf.com

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

被引用次数：63 相关文章所有 6 个版本

[PDF] thecvf.com

Cap4video: What can auxiliary captions do for text-video retrieval?

W Wu, H Luo, B Fang, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …

被引用次数：84 相关文章所有 6 个版本

[PDF] arxiv.org

Valor: Vision-audio-language omni-perception pretraining model and dataset

S Chen, X He, L Guo, X Zhu, W Wang, J Tang… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …

被引用次数：94 相关文章所有 4 个版本

[PDF] arxiv.org

Clip-driven fine-grained text-image person re-identification

S Yan, N Dong, L Zhang, J Tang - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org

Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to
the given text query from a pool of candidate images. Existing methods employ prior …

被引用次数：138 相关文章所有 7 个版本

[PDF] thecvf.com

Diffusionret: Generative text-video retrieval with diffusion model

P Jin, H Li, Z Cheng, K Li, X Ji, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …

被引用次数：53 相关文章所有 5 个版本

[PDF] thecvf.com

Progressive spatio-temporal prototype matching for text-video retrieval

P Li, CW Xie, L Zhao, H Xie, J Ge… - Proceedings of the …, 2023 - openaccess.thecvf.com

The performance of text-video retrieval has been significantly improved by vision-language
cross-modal learning schemes. The typical solution is to directly align the global video-level …

被引用次数：32 相关文章所有 3 个版本

[PDF] aaai.org

Covr: Learning composed video retrieval from web video captions

L Ventura, A Yang, C Schmid, G Varol - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers
both text and image queries together, to search for relevant images in a database. Most …

被引用次数：31 相关文章所有 14 个版本

[PDF] thecvf.com

Unified coarse-to-fine alignment for video-text retrieval

Z Wang, YL Sung, F Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com

The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained
alignment between visual and textual information. However, retrieving the correct video …

被引用次数：43 相关文章所有 6 个版本