Mimic before reconstruct: Enhancing masked autoencoders with feature mimicking
International Journal of Computer Vision, 2024•Springer
Masked Autoencoders (MAE) have been popular paradigms for large-scale vision
representation pre-training. However, MAE solely reconstructs the low-level RGB signals
after the decoder and lacks supervision upon high-level semantics for the encoder, thus
suffering from sub-optimal learned representations and long pre-training epochs. To
alleviate this, previous methods simply replace the pixel reconstruction targets of 75%
masked tokens by encoded features from pre-trained image-image (DINO) or image …
representation pre-training. However, MAE solely reconstructs the low-level RGB signals
after the decoder and lacks supervision upon high-level semantics for the encoder, thus
suffering from sub-optimal learned representations and long pre-training epochs. To
alleviate this, previous methods simply replace the pixel reconstruction targets of 75%
masked tokens by encoded features from pre-trained image-image (DINO) or image …
Abstract
Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by % and the previous state-of-the-art BEiT V2 base by %. Pretrained checkpoints are released at https://github.com/Alpha-VL/ConvMAE.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果