Understanding the generalization of adam in learning neural networks with proper regularization

S Jelassi, M Sander, Y Li - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Abstract Vision Transformers (ViTs) have recently achieved comparable or superior
performance to Convolutional neural networks (CNNs) in computer vision. This empirical …

被引用次数：64 相关文章所有 6 个版本

[PDF] neurips.cc

Benign overfitting in two-layer convolutional neural networks

Y Cao, Z Chen, M Belkin, Q Gu - Advances in neural …, 2022 - proceedings.neurips.cc

Modern neural networks often have great expressive power and can be trained to overfit the
training data, while still achieving a good test performance. This phenomenon is referred to …

被引用次数：92 相关文章所有 8 个版本

[PDF] neurips.cc

Robustness to unbounded smoothness of generalized signsgd

M Crawshaw, M Liu, F Orabona… - Advances in neural …, 2022 - proceedings.neurips.cc

Traditional analyses in non-convex optimization typically rely on the smoothness
assumption, namely requiring the gradients to be Lipschitz. However, recent evidence …

被引用次数：43 相关文章所有 9 个版本

[PDF] neurips.cc

Towards understanding the mixture-of-experts layer in deep learning

Z Chen, Y Deng, Y Wu, Q Gu… - Advances in neural …, 2022 - proceedings.neurips.cc

Abstract The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a
router, has achieved great success in deep learning. However, the understanding of such …

被引用次数：42 相关文章所有 6 个版本

[PDF] mlr.press

Benign overfitting in two-layer ReLU convolutional neural networks

Y Kou, Z Chen, Y Chen, Q Gu - International Conference on …, 2023 - proceedings.mlr.press

Modern deep learning models with great expressive power can be trained to overfit the
training data but still generalize well. This phenomenon is referred to as benign overfitting …

被引用次数：24 相关文章所有 7 个版本

[PDF] neurips.cc

Understanding and improving feature learning for out-of-distribution generalization

Y Chen, W Huang, K Zhou, Y Bian… - Advances in Neural …, 2024 - proceedings.neurips.cc

A common explanation for the failure of out-of-distribution (OOD) generalization is that the
model trained with empirical risk minimization (ERM) learns spurious features instead of …

被引用次数：18 相关文章所有 7 个版本

[PDF] neurips.cc

Robust learning with progressive data expansion against spurious correlation

Y Deng, Y Yang, B Mirzasoleiman… - Advances in neural …, 2024 - proceedings.neurips.cc

While deep learning models have shown remarkable performance in various tasks, they are
susceptible to learning non-generalizable _spurious features_ rather than the core features …

被引用次数：14 相关文章所有 9 个版本

[PDF] arxiv.org

Towards understanding mixture of experts in deep learning

Z Chen, Y Deng, Y Wu, Q Gu, Y Li - arXiv preprint arXiv:2208.02813, 2022 - arxiv.org

The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has
achieved great success in deep learning. However, the understanding of such architecture …

被引用次数：50 相关文章所有 2 个版本

[PDF] mlr.press

The benefits of mixup for feature learning

D Zou, Y Cao, Y Li, Q Gu - International Conference on …, 2023 - proceedings.mlr.press

Mixup, a simple data augmentation method that randomly mixes two data points via linear
interpolation, has been extensively applied in various deep learning applications to gain …

被引用次数：19 相关文章所有 9 个版本

[PDF] neurips.cc

The mechanism of prediction head in non-contrastive self-supervised learning

Z Wen, Y Li - Advances in Neural Information Processing …, 2022 - proceedings.neurips.cc

The surprising discovery of the BYOL method shows the negative samples can be replaced
by adding the prediction head to the network. It is mysterious why even when there exist …

被引用次数：31 相关文章所有 7 个版本