Delving deep into the generalization of vision transformers under distribution shifts- 学术资源搜索

Delving deep into the generalization of vision transformers under distribution shifts

C Zhang, M Zhang, S Zhang, D Jin… - Proceedings of the …, 2022 - openaccess.thecvf.com

C Zhang, M Zhang, S Zhang, D Jin, Q Zhou, Z Cai, H Zhao, X Liu, Z Liu

Proceedings of the IEEE/CVF conference on Computer Vision and …, 2022•openaccess.thecvf.com

Abstract Recently, Vision Transformers have achieved impressive results on various Vision
tasks. Yet, their generalization ability under different distribution shifts is poorly understood.
In this work, we provide a comprehensive study on the out-of-distribution generalization of
Vision Transformers. To support a systematic investigation, we first present a taxonomy of
distribution shifts by categorizing them into five conceptual levels: corruption shift,
background shift, texture shift, destruction shift, and style shift. Then we perform extensive …

Abstract

Recently, Vision Transformers have achieved impressive results on various Vision tasks. Yet, their generalization ability under different distribution shifts is poorly understood. In this work, we provide a comprehensive study on the out-of-distribution generalization of Vision Transformers. To support a systematic investigation, we first present a taxonomy of distribution shifts by categorizing them into five conceptual levels: corruption shift, background shift, texture shift, destruction shift, and style shift. Then we perform extensive evaluations of Vision Transformer variants under different levels of distribution shifts and compare their generalization ability with Convolutional Neural Network (CNN) models. Several important observations are obtained: 1) Vision Transformers generalize better than CNNs under multiple distribution shifts. With the same or less amount of parameters, Vision Transformers are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of distribution shift. In particular, Vision Transformers lead by more than 10% under the corruption shifts. 2) larger Vision Transformers gradually narrow the in-distribution (ID) and out-of-distribution (OOD) performance gap. To further improve the generalization of Vision Transformers, we design the enhanced Vision Transformers through self-supervised learning, information theory, and adversarial learning. By investigating these three types of generalization-enhanced Transformers, we observe the gradient-sensitivity of Vision Transformers and design a smoother learning strategy to achieve a stable training process. With modified training schemes, we achieve improvements on performance towards out-of-distribution data by 4% from vanilla Vision Transformers. We comprehensively compare these three types of generalization-enhanced Vision Transformers with their corresponding CNN models and observe that: 1) For the enhanced model, larger Vision Transformers still benefit more from the out-of-distribution generalization. 2) generalization-enhanced Vision Transformers are more sensitive to the hyper-parameters than their corresponding CNN models. We hope our comprehensive study could shed light on the design of more generalizable learning systems.

openaccess.thecvf.com

展开收起

被引用次数：113 相关文章所有 7 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果