Davit: Dual attention vision transformers
In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective
vision transformer architecture that is able to capture global context while maintaining …
vision transformer architecture that is able to capture global context while maintaining …
Rethinking network design and local geometry in point cloud: A simple residual MLP framework
Point cloud analysis is challenging due to irregularity and unordered data structure. To
capture the 3D geometries, prior works mainly rely on exploring sophisticated local …
capture the 3D geometries, prior works mainly rely on exploring sophisticated local …
Patches are all you need?
A Trockman, JZ Kolter - arXiv preprint arXiv:2201.09792, 2022 - arxiv.org
Although convolutional networks have been the dominant architecture for vision tasks for
many years, recent experiments have shown that Transformer-based models, most notably …
many years, recent experiments have shown that Transformer-based models, most notably …
Focal modulation networks
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is
completely replaced by a focal modulation module for modeling token interactions in vision …
completely replaced by a focal modulation module for modeling token interactions in vision …
A survey on vision transformer
Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …
network mainly based on the self-attention mechanism. Thanks to its strong representation …
More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity
Transformers have quickly shined in the computer vision world since the emergence of
Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) …
Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) …
S2-mlp: Spatial-shift mlp architecture for vision
Abstract Recently, visual Transformer (ViT) and its following works abandon the convolution
and exploit the self-attention operation, attaining a comparable or even higher accuracy than …
and exploit the self-attention operation, attaining a comparable or even higher accuracy than …
视觉Transformer 研究的关键问题: 现状及展望
田永林, 王雨桐, 王建功, 王晓, 王飞跃 - 自动化学报, 2022 - aas.net.cn
Transformer 所具备的长距离建模能力和并行计算能力使其在自然语言处理领域取得了巨大
成功并逐步拓展至计算机视觉等领域. 本文以分类任务为切入, 介绍了典型视觉Transformer …
成功并逐步拓展至计算机视觉等领域. 本文以分类任务为切入, 介绍了典型视觉Transformer …
An image patch is a wave: Phase-aware vision mlp
In the field of computer vision, recent works show that a pure MLP architecture mainly
stacked by fully-connected layers can achieve competing performance with CNN and …
stacked by fully-connected layers can achieve competing performance with CNN and …
Spatio-temporal relation modeling for few-shot action recognition
We propose a novel few-shot action recognition framework, STRM, which enhances class-
specific feature discriminability while simultaneously learning higher-order temporal …
specific feature discriminability while simultaneously learning higher-order temporal …