- 学术资源搜索

Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc

Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

被引用次数：44 相关文章所有 6 个版本

[PDF] mlr.press

Robust training under label noise by over-parameterization

S Liu, Z Zhu, Q Qu, C You - International Conference on …, 2022 - proceedings.mlr.press

Recently, over-parameterized deep networks, with increasingly more network parameters
than training samples, have dominated the performances of modern machine learning …

被引用次数：126 相关文章所有 6 个版本

[PDF] arxiv.org

Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arXiv preprint arXiv …, 2023 - arxiv.org

Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

被引用次数：73 相关文章所有 2 个版本

[PDF] neurips.cc

Don't blame dataset shift! shortcut learning due to gradients and cross entropy

AM Puli, L Zhang, Y Wald… - Advances in Neural …, 2023 - proceedings.neurips.cc

Common explanations for shortcut learning assume that the shortcut improves prediction
only under the training distribution. Thus, models trained in the typical way by minimizing log …

被引用次数：21 相关文章所有 7 个版本

[PDF] neurips.cc

Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent

Z Li, T Wang, JD Lee, S Arora - Advances in Neural …, 2022 - proceedings.neurips.cc

As part of the effort to understand implicit bias of gradient descent in overparametrized
models, several results have shown how the training trajectory on the overparametrized …

被引用次数：29 相关文章所有 11 个版本

[PDF] arxiv.org

Margin maximization in attention mechanism

DA Tarzanagh, Y Li, X Zhang, S Oymak - arXiv preprint arXiv:2306.13596, 2023 - arxiv.org

Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

被引用次数：4 相关文章

[PDF] neurips.cc

Faster margin maximization rates for generic optimization methods

G Wang, Z Hu, V Muthukumar… - Advances in Neural …, 2023 - proceedings.neurips.cc

First-order optimization methods tend to inherently favor certain solutions over others when
minimizing a given training objective with multiple local optima. This phenomenon, known …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

The Implicit Bias of Adam on Separable Data

C Zhang, D Zou, Y Cao - arXiv preprint arXiv:2406.10650, 2024 - arxiv.org

Adam has become one of the most favored optimizers in deep learning problems. Despite its
success in practice, numerous mysteries persist regarding its theoretical understanding. In …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

N Tsilivis, G Vardi, J Kempe - arXiv preprint arXiv:2410.22069, 2024 - arxiv.org

We study the implicit bias of the general family of steepest descent algorithms, which
includes gradient descent, sign descent and coordinate descent, in deep homogeneous …

Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling

M Wang, Z Min, L Wu - arXiv preprint arXiv:2311.14387, 2023 - arxiv.org

In this work, we investigate the margin-maximization bias exhibited by gradient-based
algorithms in classifying linearly separable data. We present an in-depth analysis of the …

被引用次数：1 相关文章所有 5 个版本