Internvideo: General video foundation models via generative and discriminative learning
The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …
downstream tasks in computer vision. However, most existing vision foundation models …
Self-chained image-language model for video localization and question answering
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …
models for video question answering. While these image-language models can efficiently …
Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …
have achieved outstanding performance, which pursue semantic interaction upon pre …
Cap4video: What can auxiliary captions do for text-video retrieval?
Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …
visual content of videos and textual query sentences. However, in real-world scenarios …
Valor: Vision-audio-language omni-perception pretraining model and dataset
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
Clip-driven fine-grained text-image person re-identification
Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to
the given text query from a pool of candidate images. Existing methods employ prior …
the given text query from a pool of candidate images. Existing methods employ prior …
Diffusionret: Generative text-video retrieval with diffusion model
Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
Progressive spatio-temporal prototype matching for text-video retrieval
The performance of text-video retrieval has been significantly improved by vision-language
cross-modal learning schemes. The typical solution is to directly align the global video-level …
cross-modal learning schemes. The typical solution is to directly align the global video-level …
Covr: Learning composed video retrieval from web video captions
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers
both text and image queries together, to search for relevant images in a database. Most …
both text and image queries together, to search for relevant images in a database. Most …
Unified coarse-to-fine alignment for video-text retrieval
The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained
alignment between visual and textual information. However, retrieving the correct video …
alignment between visual and textual information. However, retrieving the correct video …