- 学术资源搜索

Davit: Dual attention vision transformers

M Ding, B Xiao, N Codella, P Luo, J Wang… - European conference on …, 2022 - Springer

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective
vision transformer architecture that is able to capture global context while maintaining …

被引用次数：228 相关文章所有 5 个版本

[PDF] arxiv.org

Rethinking network design and local geometry in point cloud: A simple residual MLP framework

X Ma, C Qin, H You, H Ran, Y Fu - arXiv preprint arXiv:2202.07123, 2022 - arxiv.org

Point cloud analysis is challenging due to irregularity and unordered data structure. To
capture the 3D geometries, prior works mainly rely on exploring sophisticated local …

被引用次数：479 相关文章所有 3 个版本

[PDF] arxiv.org

Patches are all you need?

A Trockman, JZ Kolter - arXiv preprint arXiv:2201.09792, 2022 - arxiv.org

Although convolutional networks have been the dominant architecture for vision tasks for
many years, recent experiments have shown that Transformer-based models, most notably …

被引用次数：388 相关文章所有 5 个版本

[PDF] neurips.cc

Focal modulation networks

J Yang, C Li, X Dai, J Gao - Advances in Neural Information …, 2022 - proceedings.neurips.cc

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is
completely replaced by a focal modulation module for modeling token interactions in vision …

被引用次数：147 相关文章所有 6 个版本

[PDF] baai.ac.cn

A survey on vision transformer

K Han, Y Wang, H Chen, X Chen, J Guo… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …

被引用次数：1644 相关文章所有 7 个版本

[PDF] arxiv.org

More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity

S Liu, T Chen, X Chen, X Chen, Q Xiao, B Wu… - arXiv preprint arXiv …, 2022 - arxiv.org

Transformers have quickly shined in the computer vision world since the emergence of
Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) …

被引用次数：131 相关文章所有 12 个版本

[PDF] thecvf.com

S2-mlp: Spatial-shift mlp architecture for vision

T Yu, X Li, Y Cai, M Sun, P Li - Proceedings of the IEEE/CVF …, 2022 - openaccess.thecvf.com

Abstract Recently, visual Transformer (ViT) and its following works abandon the convolution
and exploit the self-attention operation, attaining a comparable or even higher accuracy than …

被引用次数：182 相关文章所有 7 个版本

视觉Transformer 研究的关键问题: 现状及展望

田永林，王雨桐，王建功，王晓，王飞跃 - 自动化学报, 2022 - aas.net.cn

Transformer 所具备的长距离建模能力和并行计算能力使其在自然语言处理领域取得了巨大
成功并逐步拓展至计算机视觉等领域. 本文以分类任务为切入, 介绍了典型视觉Transformer …

被引用次数：24 相关文章所有 3 个版本

[PDF] thecvf.com

An image patch is a wave: Phase-aware vision mlp

Y Tang, K Han, J Guo, C Xu, Y Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

In the field of computer vision, recent works show that a pure MLP architecture mainly
stacked by fully-connected layers can achieve competing performance with CNN and …

被引用次数：110 相关文章所有 5 个版本

[PDF] thecvf.com

Spatio-temporal relation modeling for few-shot action recognition

A Thatipelli, S Narayan, S Khan… - Proceedings of the …, 2022 - openaccess.thecvf.com

We propose a novel few-shot action recognition framework, STRM, which enhances class-
specific feature discriminability while simultaneously learning higher-order temporal …

被引用次数：100 相关文章所有 13 个版本