Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - arXiv preprint arXiv …, 2022 - arxiv.org
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Language models with image descriptors are strong few-shot video-language learners

Z Wang, M Li, R Xu, L Zhou, J Lei… - Advances in …, 2022 - proceedings.neurips.cc
The goal of this work is to build flexible video-language models that can generalize to
various video-to-text tasks from few examples. Existing few-shot video-language learners …

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Cap4video: What can auxiliary captions do for text-video retrieval?

W Wu, H Luo, B Fang, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …

Valor: Vision-audio-language omni-perception pretraining model and dataset

S Chen, X He, L Guo, X Zhu, W Wang, J Tang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …

Diffusionret: Generative text-video retrieval with diffusion model

P Jin, H Li, Z Cheng, K Li, X Ji, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …

Clip-vip: Adapting pre-trained image-text model to video-language representation alignment

H Xue, Y Sun, B Liu, J Fu, R Song, H Li… - arXiv preprint arXiv …, 2022 - arxiv.org
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-
language representation learned from a large scale of web-collected image-text data. In light …

Deep learning for video-text retrieval: a review

C Zhu, Q Jia, W Chen, Y Guo, Y Liu - International Journal of Multimedia …, 2023 - Springer
Abstract Video-Text Retrieval (VTR) aims to search for the most relevant video related to the
semantics in a given sentence, and vice versa. In general, this retrieval task is composed of …

Revisiting temporal modeling for clip-based image-to-video knowledge transferring

R Liu, J Huang, G Li, J Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text pretrained models, eg, CLIP, have shown impressive general multi-modal
knowledge learned from large-scale image-text data pairs, thus attracting increasing …