Coca: Contrastive captioners are image-text foundation models

J Yu, Z Wang, V Vasudevan, L Yeung… - arXiv preprint arXiv …, 2022 - arxiv.org
Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …

Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - arXiv preprint arXiv …, 2022 - arxiv.org
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

Videoclip: Contrastive pre-training for zero-shot video-text understanding

H Xu, G Ghosh, PY Huang, D Okhonko… - arXiv preprint arXiv …, 2021 - arxiv.org
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan, T Li - Neurocomputing, 2022 - Elsevier
Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Y Ma, G Xu, X Sun, M Yan, J Zhang, R Ji - Proceedings of the 30th ACM …, 2022 - dl.acm.org
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …

Simple but effective: Clip embeddings for embodied ai

A Khandelwal, L Weihs, R Mottaghi… - Proceedings of the …, 2022 - openaccess.thecvf.com
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial
for a range of visual tasks from classification and detection to captioning and image …

X-pool: Cross-modal language-video attention for text-video retrieval

SK Gorti, N Vouitsis, J Ma, K Golestan… - Proceedings of the …, 2022 - openaccess.thecvf.com
In text-video retrieval, the objective is to learn a cross-modal similarity function between a
text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However …

Bridging video-text retrieval with multiple choice questions

Y Ge, Y Ge, X Liu, D Li, Y Shan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Pre-training a model to learn transferable video-text representation for retrieval has attracted
a lot of attention in recent years. Previous dominant works mainly adopt two separate …

Clip2video: Mastering video-text retrieval via image clip

H Fang, P Xiong, L Xu, Y Chen - arXiv preprint arXiv:2106.11097, 2021 - arxiv.org
We present CLIP2Video network to transfer the image-language pre-training model to video-
text retrieval in an end-to-end manner. Leading approaches in the domain of video-and …

Clip4clip: An empirical study of clip for end to end video clip retrieval

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan… - arXiv preprint arXiv …, 2021 - arxiv.org
Video-text retrieval plays an essential role in multi-modal research and has been widely
used in many real-world web applications. The CLIP (Contrastive Language-Image Pre …