Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

A survey on vision transformer

K Han, Y Wang, H Chen, X Chen, J Guo… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …

A survey on visual transformer

K Han, Y Wang, H Chen, X Chen, J Guo, Z Liu… - arXiv preprint arXiv …, 2020 - arxiv.org
Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …

Tokens-to-token vit: Training vision transformers from scratch on imagenet

L Yuan, Y Chen, T Wang, W Yu, Y Shi… - Proceedings of the …, 2021 - openaccess.thecvf.com
Transformers, which are popular for language modeling, have been explored for solving
vision tasks recently, eg, the Vision Transformer (ViT) for image classification. The ViT model …

[HTML][HTML] Transformers in computational visual media: A survey

Y Xu, H Wei, M Lin, Y Deng, K Sheng, M Zhang… - Computational Visual …, 2022 - Springer
Transformers, the dominant architecture for natural language processing, have also recently
attracted much attention from computational visual media researchers due to their capacity …

Long-short transformer: Efficient transformers for language and vision

C Zhu, W Ping, C Xiao, M Shoeybi… - Advances in neural …, 2021 - proceedings.neurips.cc
Transformers have achieved success in both language and vision domains. However, it is
prohibitively expensive to scale them to long sequences such as long documents or high …

Incorporating convolution designs into visual transformers

K Yuan, S Guo, Z Liu, A Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com
Motivated by the success of Transformers in natural language processing (NLP) tasks, there
exist some attempts (eg, ViT and DeiT) to apply Transformers to the vision domain. However …

Learning to merge tokens in vision transformers

C Renggli, AS Pinto, N Houlsby, B Mustafa… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformers are widely applied to solve natural language understanding and computer
vision tasks. While scaling up these architectures leads to improved performance, it often …

Pay attention to mlps

H Liu, Z Dai, D So, QV Le - Advances in neural information …, 2021 - proceedings.neurips.cc
Transformers have become one of the most important architectural innovations in deep
learning and have enabled many breakthroughs over the past few years. Here we propose a …

Localvit: Bringing locality to vision transformers

Y Li, K Zhang, J Cao, R Timofte, L Van Gool - arXiv preprint arXiv …, 2021 - arxiv.org
We study how to introduce locality mechanisms into vision transformers. The transformer
network originates from machine translation and is particularly good at modelling long-range …