[HTML][HTML] Transformers in medical image analysis

K He, C Gan, Z Li, I Rekik, Z Yin, W Ji, Y Gao, Q Wang… - Intelligent …, 2023 - Elsevier
Transformers have dominated the field of natural language processing and have recently
made an impact in the area of computer vision. In the field of medical image analysis …

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Omnivl: One foundation model for image-language and video-language tasks

J Wang, D Chen, Z Wu, C Luo, L Zhou… - Advances in neural …, 2022 - proceedings.neurips.cc
This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …

Videoclip: Contrastive pre-training for zero-shot video-text understanding

H Xu, G Ghosh, PY Huang, D Okhonko… - arXiv preprint arXiv …, 2021 - arxiv.org
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …

Egocentric video-language pretraining

KQ Lin, J Wang, M Soldan, M Wray… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Video-Language Pretraining (VLP), which aims to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received increasing …

Swinbert: End-to-end transformers with sparse attention for video captioning

K Lin, L Li, CC Lin, F Ahmed, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com
The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …

Cvt: Introducing convolutions to vision transformers

H Wu, B Xiao, N Codella, M Liu, X Dai… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present in this paper a new architecture, named Convolutional vision Transformer (CvT),
that improves Vision Transformer (ViT) in performance and efficiency by introducing …

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …