Levit: a vision transformer in convnet's clothing for faster inference

C Zhou, Q Li, C Li, J Yu, Y Liu, G Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks with different data modalities. A PFM (eg, BERT, ChatGPT, and GPT-4) is …

被引用次数：409 相关文章所有 2 个版本

[PDF] arxiv.org

Transformers in medical imaging: A survey

F Shamshad, S Khan, SW Zamir, MH Khan… - Medical Image …, 2023 - Elsevier

Following unprecedented success on the natural language tasks, Transformers have been
successfully applied to several computer vision problems, achieving state-of-the-art results …

被引用次数：499 相关文章所有 9 个版本

[PDF] thecvf.com

Run, don't walk: chasing higher FLOPS for faster neural networks

J Chen, S Kao, H He, W Zhuo, S Wen… - Proceedings of the …, 2023 - openaccess.thecvf.com

To design fast neural networks, many works have been focusing on reducing the number of
floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does …

被引用次数：547 相关文章所有 10 个版本

[PDF] thecvf.com

Efficientvit: Memory efficient vision transformer with cascaded group attention

X Liu, H Peng, N Zheng, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers have shown great success due to their high model capabilities.
However, their remarkable performance is accompanied by heavy computation costs, which …

被引用次数：133 相关文章所有 8 个版本

[PDF] arxiv.org

Deit iii: Revenge of the vit

H Touvron, M Cord, H Jégou - European conference on computer vision, 2022 - Springer

Abstract A Vision Transformer (ViT) is a simple neural architecture amenable to serve
several computer vision tasks. It has limited built-in architectural priors, in contrast to more …

被引用次数：289 相关文章所有 8 个版本

[PDF] arxiv.org

Davit: Dual attention vision transformers

M Ding, B Xiao, N Codella, P Luo, J Wang… - European conference on …, 2022 - Springer

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective
vision transformer architecture that is able to capture global context while maintaining …

被引用次数：238 相关文章所有 5 个版本

[PDF] thecvf.com

Rethinking vision transformers for mobilenet size and speed

Y Li, J Hu, Y Wen, G Evangelidis… - Proceedings of the …, 2023 - openaccess.thecvf.com

With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to
optimize the performance and complexity of ViTs to enable efficient deployment on mobile …

被引用次数：103 相关文章所有 5 个版本

[PDF] thecvf.com

Topformer: Token pyramid transformer for mobile semantic segmentation

W Zhang, Z Huang, G Luo, T Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com

Although vision transformers (ViTs) have achieved great success in computer vision, the
heavy computational cost hampers their applications to dense prediction tasks such as …

被引用次数：186 相关文章所有 6 个版本

[PDF] ecva.net

Tinyvit: Fast pretraining distillation for small vision transformers

K Wu, J Zhang, H Peng, M Liu, B Xiao, J Fu… - European conference on …, 2022 - Springer

Vision transformer (ViT) recently has drawn great attention in computer vision due to its
remarkable model capability. However, most prevailing ViT models suffer from huge number …

被引用次数：152 相关文章所有 5 个版本

[PDF] arxiv.org

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

被引用次数：266 相关文章所有 6 个版本