Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

Robust training under label noise by over-parameterization

S Liu, Z Zhu, Q Qu, C You - International Conference on …, 2022 - proceedings.mlr.press
Recently, over-parameterized deep networks, with increasingly more network parameters
than training samples, have dominated the performances of modern machine learning …

Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arXiv preprint arXiv …, 2023 - arxiv.org
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

Don't blame dataset shift! shortcut learning due to gradients and cross entropy

AM Puli, L Zhang, Y Wald… - Advances in Neural …, 2023 - proceedings.neurips.cc
Common explanations for shortcut learning assume that the shortcut improves prediction
only under the training distribution. Thus, models trained in the typical way by minimizing log …

Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent

Z Li, T Wang, JD Lee, S Arora - Advances in Neural …, 2022 - proceedings.neurips.cc
As part of the effort to understand implicit bias of gradient descent in overparametrized
models, several results have shown how the training trajectory on the overparametrized …

Margin maximization in attention mechanism

DA Tarzanagh, Y Li, X Zhang, S Oymak - arXiv preprint arXiv:2306.13596, 2023 - arxiv.org
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

Faster margin maximization rates for generic optimization methods

G Wang, Z Hu, V Muthukumar… - Advances in Neural …, 2023 - proceedings.neurips.cc
First-order optimization methods tend to inherently favor certain solutions over others when
minimizing a given training objective with multiple local optima. This phenomenon, known …

The Implicit Bias of Adam on Separable Data

C Zhang, D Zou, Y Cao - arXiv preprint arXiv:2406.10650, 2024 - arxiv.org
Adam has become one of the most favored optimizers in deep learning problems. Despite its
success in practice, numerous mysteries persist regarding its theoretical understanding. In …

Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

N Tsilivis, G Vardi, J Kempe - arXiv preprint arXiv:2410.22069, 2024 - arxiv.org
We study the implicit bias of the general family of steepest descent algorithms, which
includes gradient descent, sign descent and coordinate descent, in deep homogeneous …

Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling

M Wang, Z Min, L Wu - arXiv preprint arXiv:2311.14387, 2023 - arxiv.org
In this work, we investigate the margin-maximization bias exhibited by gradient-based
algorithms in classifying linearly separable data. We present an in-depth analysis of the …