Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org
The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2024 - proceedings.neurips.cc
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Cap4video: What can auxiliary captions do for text-video retrieval?

W Wu, H Luo, B Fang, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …

Valor: Vision-audio-language omni-perception pretraining model and dataset

S Chen, X He, L Guo, X Zhu, W Wang, J Tang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …

Clip-driven fine-grained text-image person re-identification

S Yan, N Dong, L Zhang, J Tang - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org
Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to
the given text query from a pool of candidate images. Existing methods employ prior …

Diffusionret: Generative text-video retrieval with diffusion model

P Jin, H Li, Z Cheng, K Li, X Ji, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …

Progressive spatio-temporal prototype matching for text-video retrieval

P Li, CW Xie, L Zhao, H Xie, J Ge… - Proceedings of the …, 2023 - openaccess.thecvf.com
The performance of text-video retrieval has been significantly improved by vision-language
cross-modal learning schemes. The typical solution is to directly align the global video-level …

Covr: Learning composed video retrieval from web video captions

L Ventura, A Yang, C Schmid, G Varol - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers
both text and image queries together, to search for relevant images in a database. Most …

Unified coarse-to-fine alignment for video-text retrieval

Z Wang, YL Sung, F Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com
The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained
alignment between visual and textual information. However, retrieving the correct video …