Vision transformers provably learn spatial structure
Abstract Vision Transformers (ViTs) have recently achieved comparable or superior
performance to Convolutional neural networks (CNNs) in computer vision. This empirical …
performance to Convolutional neural networks (CNNs) in computer vision. This empirical …
Benign overfitting in two-layer convolutional neural networks
Modern neural networks often have great expressive power and can be trained to overfit the
training data, while still achieving a good test performance. This phenomenon is referred to …
training data, while still achieving a good test performance. This phenomenon is referred to …
Robustness to unbounded smoothness of generalized signsgd
Traditional analyses in non-convex optimization typically rely on the smoothness
assumption, namely requiring the gradients to be Lipschitz. However, recent evidence …
assumption, namely requiring the gradients to be Lipschitz. However, recent evidence …
Towards understanding the mixture-of-experts layer in deep learning
Abstract The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a
router, has achieved great success in deep learning. However, the understanding of such …
router, has achieved great success in deep learning. However, the understanding of such …
Benign overfitting in two-layer ReLU convolutional neural networks
Modern deep learning models with great expressive power can be trained to overfit the
training data but still generalize well. This phenomenon is referred to as benign overfitting …
training data but still generalize well. This phenomenon is referred to as benign overfitting …
Understanding and improving feature learning for out-of-distribution generalization
A common explanation for the failure of out-of-distribution (OOD) generalization is that the
model trained with empirical risk minimization (ERM) learns spurious features instead of …
model trained with empirical risk minimization (ERM) learns spurious features instead of …
Robust learning with progressive data expansion against spurious correlation
While deep learning models have shown remarkable performance in various tasks, they are
susceptible to learning non-generalizable _spurious features_ rather than the core features …
susceptible to learning non-generalizable _spurious features_ rather than the core features …
Towards understanding mixture of experts in deep learning
The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has
achieved great success in deep learning. However, the understanding of such architecture …
achieved great success in deep learning. However, the understanding of such architecture …
The benefits of mixup for feature learning
Mixup, a simple data augmentation method that randomly mixes two data points via linear
interpolation, has been extensively applied in various deep learning applications to gain …
interpolation, has been extensively applied in various deep learning applications to gain …
The mechanism of prediction head in non-contrastive self-supervised learning
The surprising discovery of the BYOL method shows the negative samples can be replaced
by adding the prediction head to the network. It is mysterious why even when there exist …
by adding the prediction head to the network. It is mysterious why even when there exist …