Socratic models: Composing zero-shot multimodal reasoning with language
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …
domain of data they are trained on. While these domains are generic, they may only barely …
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …
while other modalities such as audio and subtitles in videos have not received sufficient …
Language models with image descriptors are strong few-shot video-language learners
The goal of this work is to build flexible video-language models that can generalize to
various video-to-text tasks from few examples. Existing few-shot video-language learners …
various video-to-text tasks from few examples. Existing few-shot video-language learners …
Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …
have achieved outstanding performance, which pursue semantic interaction upon pre …
Cap4video: What can auxiliary captions do for text-video retrieval?
Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …
visual content of videos and textual query sentences. However, in real-world scenarios …
Valor: Vision-audio-language omni-perception pretraining model and dataset
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
Diffusionret: Generative text-video retrieval with diffusion model
Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
Clip-vip: Adapting pre-trained image-text model to video-language representation alignment
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-
language representation learned from a large scale of web-collected image-text data. In light …
language representation learned from a large scale of web-collected image-text data. In light …
Deep learning for video-text retrieval: a review
Abstract Video-Text Retrieval (VTR) aims to search for the most relevant video related to the
semantics in a given sentence, and vice versa. In general, this retrieval task is composed of …
semantics in a given sentence, and vice versa. In general, this retrieval task is composed of …
Revisiting temporal modeling for clip-based image-to-video knowledge transferring
Image-text pretrained models, eg, CLIP, have shown impressive general multi-modal
knowledge learned from large-scale image-text data pairs, thus attracting increasing …
knowledge learned from large-scale image-text data pairs, thus attracting increasing …