[HTML][HTML] Transformers in medical image analysis
Transformers have dominated the field of natural language processing and have recently
made an impact in the area of computer vision. In the field of medical image analysis …
made an impact in the area of computer vision. In the field of medical image analysis …
Transformers in vision: A survey
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …
vision community to study their application to computer vision problems. Among their salient …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
Omnivl: One foundation model for image-language and video-language tasks
This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …
video-language tasks using one universal architecture. It adopts a unified transformer-based …
Videoclip: Contrastive pre-training for zero-shot video-text understanding
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
Egocentric video-language pretraining
Abstract Video-Language Pretraining (VLP), which aims to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received increasing …
to advance a wide range of video-text downstream tasks, has recently received increasing …
Swinbert: End-to-end transformers with sparse attention for video captioning
The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …
from offline-extracted dense video features. These feature extractors usually operate on …
Cvt: Introducing convolutions to vision transformers
We present in this paper a new architecture, named Convolutional vision Transformer (CvT),
that improves Vision Transformer (ViT) in performance and efficiency by introducing …
that improves Vision Transformer (ViT) in performance and efficiency by introducing …
Frozen in time: A joint video and image encoder for end-to-end retrieval
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
efficient text-to-video retrieval. The challenges in this area include the design of the visual …