Scaling up your kernels to 31x31: Revisiting large kernel design in cnns
We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by
recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few …
recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few …
More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity
Transformers have quickly shined in the computer vision world since the emergence of
Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) …
Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) …
Cvt: Introducing convolutions to vision transformers
We present in this paper a new architecture, named Convolutional vision Transformer (CvT),
that improves Vision Transformer (ViT) in performance and efficiency by introducing …
that improves Vision Transformer (ViT) in performance and efficiency by introducing …
Cmt: Convolutional neural networks meet vision transformers
Vision transformers have been successfully applied to image recognition tasks due to their
ability to capture long-range dependencies within an image. However, there are still gaps in …
ability to capture long-range dependencies within an image. However, there are still gaps in …
FastViT: A fast hybrid vision transformer using structural reparameterization
The recent amalgamation of transformer and convolutional designs has led to steady
improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a …
improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a …
Patch slimming for efficient vision transformers
This paper studies the efficiency problem for visual transformers by excavating redundant
calculation in given networks. The recent transformer architecture has demonstrated its …
calculation in given networks. The recent transformer architecture has demonstrated its …
Dynamicvit: Efficient vision transformers with dynamic token sparsification
Attention is sparse in vision transformers. We observe the final prediction in vision
transformers is only based on a subset of most informative tokens, which is sufficient for …
transformers is only based on a subset of most informative tokens, which is sufficient for …
How to train your vit? data, augmentation, and regularization in vision transformers
Vision Transformers (ViT) have been shown to attain highly competitive performance for a
wide range of vision applications, such as image classification, object detection and …
wide range of vision applications, such as image classification, object detection and …
Lightvit: Towards light-weight convolution-free vision transformers
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional
neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to …
neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to …
Vision transformer with progressive sampling
Transformers with powerful global relation modeling abilities have been introduced to
fundamental computer vision tasks recently. As a typical example, the Vision Transformer …
fundamental computer vision tasks recently. As a typical example, the Vision Transformer …