Mvitv2: Improved multiscale vision transformers for classification and detection
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …
image and video classification, as well as object detection. We present an improved version …
Multiscale vision transformers
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
Video swin transformer
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure
Transformer architectures have attained top accuracy on the major video recognition …
Transformer architectures have attained top accuracy on the major video recognition …
Regionvit: Regional-to-local attention for vision transformers
Vision transformer (ViT) has recently shown its strong capability in achieving comparable
results to convolutional neural networks (CNNs) on image classification. However, vanilla …
results to convolutional neural networks (CNNs) on image classification. However, vanilla …
Multiview transformers for video recognition
Video understanding requires reasoning at multiple spatiotemporal resolutions--from short
fine-grained motions to events taking place over longer durations. Although transformer …
fine-grained motions to events taking place over longer durations. Although transformer …
Mpvit: Multi-path vision transformer for dense prediction
Dense computer vision tasks such as object detection and segmentation require effective
multi-scale feature representation for detecting or classifying objects or regions with varying …
multi-scale feature representation for detecting or classifying objects or regions with varying …
Mobile-former: Bridging mobilenet and transformer
Abstract We present Mobile-Former, a parallel design of MobileNet and transformer with a
two-way bridge in between. This structure leverages the advantages of MobileNet at local …
two-way bridge in between. This structure leverages the advantages of MobileNet at local …
Vsa: Learning varied-size window attention in vision transformers
Attention within windows has been widely explored in vision transformers to balance the
performance, computation complexity, and memory footprint. However, current models adopt …
performance, computation complexity, and memory footprint. However, current models adopt …
Maxvit: Multi-axis vision transformer
Transformers have recently gained significant attention in the computer vision community.
However, the lack of scalability of self-attention mechanisms with respect to image size has …
However, the lack of scalability of self-attention mechanisms with respect to image size has …
Transmix: Attend to mix for vision transformers
Mixup-based augmentation has been found to be effective for generalizing models during
training, especially for Vision Transformers (ViTs) since they can easily overfit. However …
training, especially for Vision Transformers (ViTs) since they can easily overfit. However …