Scaling up your kernels to 31x31: Revisiting large kernel design in cnns

X Ding, X Zhang, J Han, G Ding - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by
recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few …

More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity

S Liu, T Chen, X Chen, X Chen, Q Xiao, B Wu… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformers have quickly shined in the computer vision world since the emergence of
Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) …

Cvt: Introducing convolutions to vision transformers

H Wu, B Xiao, N Codella, M Liu, X Dai… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present in this paper a new architecture, named Convolutional vision Transformer (CvT),
that improves Vision Transformer (ViT) in performance and efficiency by introducing …

Cmt: Convolutional neural networks meet vision transformers

J Guo, K Han, H Wu, Y Tang, X Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Vision transformers have been successfully applied to image recognition tasks due to their
ability to capture long-range dependencies within an image. However, there are still gaps in …

FastViT: A fast hybrid vision transformer using structural reparameterization

PKA Vasu, J Gabriel, J Zhu, O Tuzel… - Proceedings of the …, 2023 - openaccess.thecvf.com
The recent amalgamation of transformer and convolutional designs has led to steady
improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a …

Patch slimming for efficient vision transformers

Y Tang, K Han, Y Wang, C Xu, J Guo… - Proceedings of the …, 2022 - openaccess.thecvf.com
This paper studies the efficiency problem for visual transformers by excavating redundant
calculation in given networks. The recent transformer architecture has demonstrated its …

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Y Rao, W Zhao, B Liu, J Lu, J Zhou… - Advances in neural …, 2021 - proceedings.neurips.cc
Attention is sparse in vision transformers. We observe the final prediction in vision
transformers is only based on a subset of most informative tokens, which is sufficient for …

How to train your vit? data, augmentation, and regularization in vision transformers

A Steiner, A Kolesnikov, X Zhai, R Wightman… - arXiv preprint arXiv …, 2021 - arxiv.org
Vision Transformers (ViT) have been shown to attain highly competitive performance for a
wide range of vision applications, such as image classification, object detection and …

Lightvit: Towards light-weight convolution-free vision transformers

T Huang, L Huang, S You, F Wang, C Qian… - arXiv preprint arXiv …, 2022 - arxiv.org
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional
neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to …

Vision transformer with progressive sampling

X Yue, S Sun, Z Kuang, M Wei… - Proceedings of the …, 2021 - openaccess.thecvf.com
Transformers with powerful global relation modeling abilities have been introduced to
fundamental computer vision tasks recently. As a typical example, the Vision Transformer …