Refiner: Refining self-attention for vision transformers
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks
compared with CNNs. Yet, they generally require much more data for model pre-training.
Most of recent works thus are dedicated to designing more complex architectures or training
methods to address the data-efficiency issue of ViTs. However, few of them explore
improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs.
Different from existing works, we introduce a conceptually simple scheme, called refiner, to …
compared with CNNs. Yet, they generally require much more data for model pre-training.
Most of recent works thus are dedicated to designing more complex architectures or training
methods to address the data-efficiency issue of ViTs. However, few of them explore
improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs.
Different from existing works, we introduce a conceptually simple scheme, called refiner, to …
以上显示的是最相近的搜索结果。 查看全部搜索结果