Vision transformers provably learn spatial structure

S Jelassi, M Sander, Y Li - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract Vision Transformers (ViTs) have recently achieved comparable or superior
performance to Convolutional neural networks (CNNs) in computer vision. This empirical …

Benign overfitting in two-layer convolutional neural networks

Y Cao, Z Chen, M Belkin, Q Gu - Advances in neural …, 2022 - proceedings.neurips.cc
Modern neural networks often have great expressive power and can be trained to overfit the
training data, while still achieving a good test performance. This phenomenon is referred to …

Robustness to unbounded smoothness of generalized signsgd

M Crawshaw, M Liu, F Orabona… - Advances in neural …, 2022 - proceedings.neurips.cc
Traditional analyses in non-convex optimization typically rely on the smoothness
assumption, namely requiring the gradients to be Lipschitz. However, recent evidence …

Towards understanding the mixture-of-experts layer in deep learning

Z Chen, Y Deng, Y Wu, Q Gu… - Advances in neural …, 2022 - proceedings.neurips.cc
Abstract The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a
router, has achieved great success in deep learning. However, the understanding of such …

Benign overfitting in two-layer ReLU convolutional neural networks

Y Kou, Z Chen, Y Chen, Q Gu - International Conference on …, 2023 - proceedings.mlr.press
Modern deep learning models with great expressive power can be trained to overfit the
training data but still generalize well. This phenomenon is referred to as benign overfitting …

Understanding and improving feature learning for out-of-distribution generalization

Y Chen, W Huang, K Zhou, Y Bian… - Advances in Neural …, 2024 - proceedings.neurips.cc
A common explanation for the failure of out-of-distribution (OOD) generalization is that the
model trained with empirical risk minimization (ERM) learns spurious features instead of …

Robust learning with progressive data expansion against spurious correlation

Y Deng, Y Yang, B Mirzasoleiman… - Advances in neural …, 2024 - proceedings.neurips.cc
While deep learning models have shown remarkable performance in various tasks, they are
susceptible to learning non-generalizable _spurious features_ rather than the core features …

Towards understanding mixture of experts in deep learning

Z Chen, Y Deng, Y Wu, Q Gu, Y Li - arXiv preprint arXiv:2208.02813, 2022 - arxiv.org
The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has
achieved great success in deep learning. However, the understanding of such architecture …

The benefits of mixup for feature learning

D Zou, Y Cao, Y Li, Q Gu - International Conference on …, 2023 - proceedings.mlr.press
Mixup, a simple data augmentation method that randomly mixes two data points via linear
interpolation, has been extensively applied in various deep learning applications to gain …

The mechanism of prediction head in non-contrastive self-supervised learning

Z Wen, Y Li - Advances in Neural Information Processing …, 2022 - proceedings.neurips.cc
The surprising discovery of the BYOL method shows the negative samples can be replaced
by adding the prediction head to the network. It is mysterious why even when there exist …