Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

C Zhou, Q Li, C Li, J Yu, Y Liu, G Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks with different data modalities. A PFM (eg, BERT, ChatGPT, and GPT-4) is …

被引用次数：473 相关文章所有 2 个版本

[PDF] springer.com

Attention mechanisms in computer vision: A survey

MH Guo, TX Xu, JJ Liu, ZN Liu, PT Jiang, TJ Mu… - Computational visual …, 2022 - Springer

Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …

被引用次数：1553 相关文章所有 8 个版本

[PDF] thecvf.com

Run, don't walk: chasing higher FLOPS for faster neural networks

J Chen, S Kao, H He, W Zhuo, S Wen… - Proceedings of the …, 2023 - openaccess.thecvf.com

To design fast neural networks, many works have been focusing on reducing the number of
floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does …

被引用次数：774 相关文章所有 10 个版本

[PDF] thecvf.com

Internimage: Exploring large-scale vision foundation models with deformable convolutions

W Wang, J Dai, Z Chen, Z Huang, Z Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Compared to the great progress of large-scale vision transformers (ViTs) in recent years,
large-scale models based on convolutional neural networks (CNNs) are still in an early …

被引用次数：587 相关文章所有 8 个版本

[PDF] thecvf.com

Biformer: Vision transformer with bi-level routing attention

L Zhu, X Wang, Z Ke, W Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

As the core building block of vision transformers, attention is a powerful tool to capture long-
range dependency. However, such power comes at a cost: it incurs a huge computation …

被引用次数：468 相关文章所有 10 个版本

[PDF] neurips.cc

Segnext: Rethinking convolutional attention design for semantic segmentation

MH Guo, CZ Lu, Q Hou, Z Liu… - Advances in Neural …, 2022 - proceedings.neurips.cc

We present SegNeXt, a simple convolutional network architecture for semantic
segmentation. Recent transformer-based models have dominated the field of se-mantic …

被引用次数：529 相关文章所有 6 个版本

[PDF] thecvf.com

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

被引用次数：261 相关文章所有 7 个版本

[PDF] arxiv.org

Vision mamba: Efficient visual representation learning with bidirectional state space model

L Zhu, B Liao, Q Zhang, X Wang, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently the state space models (SSMs) with efficient hardware-aware designs, ie, the
Mamba deep learning model, have shown great potential for long sequence modeling …

被引用次数：460 相关文章所有 5 个版本

[PDF] thecvf.com

Efficientvit: Memory efficient vision transformer with cascaded group attention

X Liu, H Peng, N Zheng, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers have shown great success due to their high model capabilities.
However, their remarkable performance is accompanied by heavy computation costs, which …

被引用次数：193 相关文章所有 8 个版本

[PDF] thecvf.com

Dual aggregation transformer for image super-resolution

Z Chen, Y Zhang, J Gu, L Kong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Transformer has recently gained considerable popularity in low-level vision tasks, including
image super-resolution (SR). These networks utilize self-attention along different …

被引用次数：132 相关文章所有 9 个版本