Coca: Contrastive captioners are image-text foundation models
Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …
vision because these models can be quickly transferred to many downstream tasks. This …
Socratic models: Composing zero-shot multimodal reasoning with language
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …
domain of data they are trained on. While these domains are generic, they may only barely …
Videoclip: Contrastive pre-training for zero-shot video-text understanding
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning
Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …
are the fundamental research problem for multimodal understanding and generation. The …
X-clip: End-to-end multi-grained contrastive learning for video-text retrieval
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …
development of video-text retrieval has been considerably promoted by large-scale multi …
Simple but effective: Clip embeddings for embodied ai
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial
for a range of visual tasks from classification and detection to captioning and image …
for a range of visual tasks from classification and detection to captioning and image …
X-pool: Cross-modal language-video attention for text-video retrieval
In text-video retrieval, the objective is to learn a cross-modal similarity function between a
text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However …
text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However …
Bridging video-text retrieval with multiple choice questions
Pre-training a model to learn transferable video-text representation for retrieval has attracted
a lot of attention in recent years. Previous dominant works mainly adopt two separate …
a lot of attention in recent years. Previous dominant works mainly adopt two separate …
Clip2video: Mastering video-text retrieval via image clip
We present CLIP2Video network to transfer the image-language pre-training model to video-
text retrieval in an end-to-end manner. Leading approaches in the domain of video-and …
text retrieval in an end-to-end manner. Leading approaches in the domain of video-and …
Clip4clip: An empirical study of clip for end to end video clip retrieval
Video-text retrieval plays an essential role in multi-modal research and has been widely
used in many real-world web applications. The CLIP (Contrastive Language-Image Pre …
used in many real-world web applications. The CLIP (Contrastive Language-Image Pre …