A straightforward framework for video retrieval using clip

J Yu, Z Wang, V Vasudevan, L Yeung… - arXiv preprint arXiv …, 2022 - arxiv.org

Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …

被引用次数：1175 相关文章所有 7 个版本

[PDF] arxiv.org

Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - arXiv preprint arXiv …, 2022 - arxiv.org

Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

被引用次数：432 相关文章所有 6 个版本

[PDF] arxiv.org

Videoclip: Contrastive pre-training for zero-shot video-text understanding

H Xu, G Ghosh, PY Huang, D Okhonko… - arXiv preprint arXiv …, 2021 - arxiv.org

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …

被引用次数：493 相关文章所有 4 个版本

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan, T Li - Neurocomputing, 2022 - Elsevier

Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …

被引用次数：451 相关文章所有 5 个版本

[PDF] arxiv.org

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Y Ma, G Xu, X Sun, M Yan, J Zhang, R Ji - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …

被引用次数：205 相关文章所有 3 个版本

[PDF] thecvf.com

Simple but effective: Clip embeddings for embodied ai

A Khandelwal, L Weihs, R Mottaghi… - Proceedings of the …, 2022 - openaccess.thecvf.com

Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial
for a range of visual tasks from classification and detection to captioning and image …

被引用次数：208 相关文章所有 5 个版本

[PDF] thecvf.com

X-pool: Cross-modal language-video attention for text-video retrieval

SK Gorti, N Vouitsis, J Ma, K Golestan… - Proceedings of the …, 2022 - openaccess.thecvf.com

In text-video retrieval, the objective is to learn a cross-modal similarity function between a
text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However …

被引用次数：158 相关文章所有 7 个版本

[PDF] thecvf.com

Bridging video-text retrieval with multiple choice questions

Y Ge, Y Ge, X Liu, D Li, Y Shan… - Proceedings of the …, 2022 - openaccess.thecvf.com

Pre-training a model to learn transferable video-text representation for retrieval has attracted
a lot of attention in recent years. Previous dominant works mainly adopt two separate …

被引用次数：155 相关文章所有 7 个版本

[PDF] arxiv.org

Clip2video: Mastering video-text retrieval via image clip

H Fang, P Xiong, L Xu, Y Chen - arXiv preprint arXiv:2106.11097, 2021 - arxiv.org

We present CLIP2Video network to transfer the image-language pre-training model to video-
text retrieval in an end-to-end manner. Leading approaches in the domain of video-and …

被引用次数：273 相关文章所有 2 个版本

Clip4clip: An empirical study of clip for end to end video clip retrieval

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan… - arXiv preprint arXiv …, 2021 - arxiv.org

Video-text retrieval plays an essential role in multi-modal research and has been widely
used in many real-world web applications. The CLIP (Contrastive Language-Image Pre …

被引用次数：306 相关文章所有 2 个版本