Deit iii: Revenge of the vit

H Touvron, M Cord, H Jégou - European conference on computer vision, 2022 - Springer
Abstract A Vision Transformer (ViT) is a simple neural architecture amenable to serve
several computer vision tasks. It has limited built-in architectural priors, in contrast to more …

Masked autoencoders are scalable vision learners

K He, X Chen, S Xie, Y Li, P Dollár… - Proceedings of the …, 2022 - openaccess.thecvf.com
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners
for computer vision. Our MAE approach is simple: we mask random patches of the input …

Resnet strikes back: An improved training procedure in timm

R Wightman, H Touvron, H Jégou - arXiv preprint arXiv:2110.00476, 2021 - arxiv.org
The influential Residual Networks designed by He et al. remain the gold-standard
architecture in numerous scientific publications. They typically serve as the default …

Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond

Q Zhang, Y Xu, J Zhang, D Tao - International Journal of Computer Vision, 2023 - Springer
Vision transformers have shown great potential in various computer vision tasks owing to
their strong capability to model long-range dependency using the self-attention mechanism …

Resmlp: Feedforward networks for image classification with data-efficient training

H Touvron, P Bojanowski, M Caron… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image
classification. It is a simple residual network that alternates (i) a linear layer in which image …

Transformer in transformer

K Han, A Xiao, E Wu, J Guo, C Xu… - Advances in neural …, 2021 - proceedings.neurips.cc
Transformer is a new kind of neural architecture which encodes the input data as powerful
features via the attention mechanism. Basically, the visual transformers first divide the input …

Incorporating convolution designs into visual transformers

K Yuan, S Guo, Z Liu, A Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com
Motivated by the success of Transformers in natural language processing (NLP) tasks, there
exist some attempts (eg, ViT and DeiT) to apply Transformers to the vision domain. However …

Training data-efficient image transformers & distillation through attention

H Touvron, M Cord, M Douze, F Massa… - International …, 2021 - proceedings.mlr.press
Recently, neural networks purely based on attention were shown to address image
understanding tasks such as image classification. These high-performing vision …

Vitae: Vision transformer advanced by exploring intrinsic inductive bias

Y Xu, Q Zhang, J Zhang, D Tao - Advances in neural …, 2021 - proceedings.neurips.cc
Transformers have shown great potential in various computer vision tasks owing to their
strong capability in modeling long-range dependency using the self-attention mechanism …

Autoformer: Searching transformers for visual recognition

M Chen, H Peng, J Fu, H Ling - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Recently, pure transformer-based models have shown great potentials for vision tasks such
as image classification and detection. However, the design of transformer networks is …