Grafit: Learning fine-grained image representations with coarse labels

H Touvron, M Cord, H Jégou - European conference on computer vision, 2022 - Springer

Abstract A Vision Transformer (ViT) is a simple neural architecture amenable to serve
several computer vision tasks. It has limited built-in architectural priors, in contrast to more …

被引用次数：294 相关文章所有 8 个版本

[PDF] thecvf.com

Masked autoencoders are scalable vision learners

K He, X Chen, S Xie, Y Li, P Dollár… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners
for computer vision. Our MAE approach is simple: we mask random patches of the input …

被引用次数：6227 相关文章所有 11 个版本

[PDF] arxiv.org

Resnet strikes back: An improved training procedure in timm

R Wightman, H Touvron, H Jégou - arXiv preprint arXiv:2110.00476, 2021 - arxiv.org

The influential Residual Networks designed by He et al. remain the gold-standard
architecture in numerous scientific publications. They typically serve as the default …

被引用次数：454 相关文章所有 3 个版本

[PDF] arxiv.org

Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond

Q Zhang, Y Xu, J Zhang, D Tao - International Journal of Computer Vision, 2023 - Springer

Vision transformers have shown great potential in various computer vision tasks owing to
their strong capability to model long-range dependency using the self-attention mechanism …

被引用次数：198 相关文章所有 7 个版本

[PDF] openreview.net

Resmlp: Feedforward networks for image classification with data-efficient training

H Touvron, P Bojanowski, M Caron… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image
classification. It is a simple residual network that alternates (i) a linear layer in which image …

被引用次数：513 相关文章所有 10 个版本

[PDF] neurips.cc

Transformer in transformer

K Han, A Xiao, E Wu, J Guo, C Xu… - Advances in neural …, 2021 - proceedings.neurips.cc

Transformer is a new kind of neural architecture which encodes the input data as powerful
features via the attention mechanism. Basically, the visual transformers first divide the input …

被引用次数：1523 相关文章所有 7 个版本

[PDF] thecvf.com

Incorporating convolution designs into visual transformers

K Yuan, S Guo, Z Liu, A Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com

Motivated by the success of Transformers in natural language processing (NLP) tasks, there
exist some attempts (eg, ViT and DeiT) to apply Transformers to the vision domain. However …

被引用次数：513 相关文章所有 6 个版本

[PDF] mlr.press

Training data-efficient image transformers & distillation through attention

H Touvron, M Cord, M Douze, F Massa… - International …, 2021 - proceedings.mlr.press

Recently, neural networks purely based on attention were shown to address image
understanding tasks such as image classification. These high-performing vision …

被引用次数：6306 相关文章所有 6 个版本

[PDF] neurips.cc

Vitae: Vision transformer advanced by exploring intrinsic inductive bias

Y Xu, Q Zhang, J Zhang, D Tao - Advances in neural …, 2021 - proceedings.neurips.cc

Transformers have shown great potential in various computer vision tasks owing to their
strong capability in modeling long-range dependency using the self-attention mechanism …

被引用次数：324 相关文章所有 7 个版本

[PDF] thecvf.com

Autoformer: Searching transformers for visual recognition

M Chen, H Peng, J Fu, H Ling - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

Recently, pure transformer-based models have shown great potentials for vision tasks such
as image classification and detection. However, the design of transformer networks is …

被引用次数：283 相关文章所有 5 个版本