Benchmarking detection transfer learning with vision transformers

BXB Yu, J Chang, H Wang, L Liu, S Wang… - ACM Computing …, 2023 - dl.acm.org

Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …

被引用次数：20 相关文章所有 4 个版本

[PDF] mdpi.com

A comprehensive survey of transformers for computer vision

S Jamil, M Jalil Piran, OJ Kwon - Drones, 2023 - mdpi.com

As a special type of transformer, vision transformers (ViTs) can be used for various computer
vision (CV) applications. Convolutional neural networks (CNNs) have several potential …

被引用次数：26 相关文章所有 8 个版本

[PDF] thecvf.com

Adding conditional control to text-to-image diffusion models

L Zhang, A Rao, M Agrawala - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

We present ControlNet, a neural network architecture to add spatial conditioning controls to
large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large …

被引用次数：2042 相关文章所有 6 个版本

[PDF] arxiv.org

Vision transformer adapter for dense predictions

Z Chen, Y Duan, W Wang, J He, T Lu, J Dai… - arXiv preprint arXiv …, 2022 - arxiv.org

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike
recent visual transformers that introduce vision-specific inductive biases into their …

被引用次数：428 相关文章所有 3 个版本

[PDF] arxiv.org

Exploring plain vision transformer backbones for object detection

Y Li, H Mao, R Girshick, K He - European conference on computer vision, 2022 - Springer

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for
object detection. This design enables the original ViT architecture to be fine-tuned for object …

被引用次数：658 相关文章所有 6 个版本

[PDF] arxiv.org

Visual prompt tuning

M Jia, L Tang, BC Chen, C Cardie, S Belongie… - … on Computer Vision, 2022 - Springer

The current modus operandi in adapting pre-trained models involves updating all the
backbone parameters, ie., full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) …

被引用次数：1174 相关文章所有 7 个版本

[PDF] arxiv.org

Layoutlmv3: Pre-training for document ai with unified text and image masking

Y Huang, T Lv, L Cui, Y Lu, F Wei - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Self-supervised pre-training techniques have achieved remarkable progress in Document
AI. Most multimodal pre-trained models use a masked language modeling objective to learn …

被引用次数：328 相关文章所有 3 个版本

[PDF] mlr.press

Real-world robot learning with masked visual pre-training

I Radosavovic, T Xiao, S James… - … on Robot Learning, 2023 - proceedings.mlr.press

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-
wild videos for real-world robotic tasks. Like prior work, our visual representations are pre …

被引用次数：176 相关文章所有 4 个版本

[PDF] thecvf.com

Mvitv2: Improved multiscale vision transformers for classification and detection

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

被引用次数：638 相关文章所有 6 个版本

[PDF] thecvf.com

Masked autoencoders are scalable vision learners

K He, X Chen, S Xie, Y Li, P Dollár… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners
for computer vision. Our MAE approach is simple: we mask random patches of the input …

被引用次数：6192 相关文章所有 11 个版本