Max-margin token selection in attention mechanism
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …
phenomenal success of large language models. However, the theoretical principles …
Robust training under label noise by over-parameterization
Recently, over-parameterized deep networks, with increasingly more network parameters
than training samples, have dominated the performances of modern machine learning …
than training samples, have dominated the performances of modern machine learning …
Transformers as support vector machines
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …
revolutionary advancements in NLP. The attention layer within the transformer admits a …
Don't blame dataset shift! shortcut learning due to gradients and cross entropy
Common explanations for shortcut learning assume that the shortcut improves prediction
only under the training distribution. Thus, models trained in the typical way by minimizing log …
only under the training distribution. Thus, models trained in the typical way by minimizing log …
Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent
As part of the effort to understand implicit bias of gradient descent in overparametrized
models, several results have shown how the training trajectory on the overparametrized …
models, several results have shown how the training trajectory on the overparametrized …
Margin maximization in attention mechanism
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …
phenomenal success of large language models. However, the theoretical principles …
Faster margin maximization rates for generic optimization methods
G Wang, Z Hu, V Muthukumar… - Advances in Neural …, 2023 - proceedings.neurips.cc
First-order optimization methods tend to inherently favor certain solutions over others when
minimizing a given training objective with multiple local optima. This phenomenon, known …
minimizing a given training objective with multiple local optima. This phenomenon, known …
The Implicit Bias of Adam on Separable Data
Adam has become one of the most favored optimizers in deep learning problems. Despite its
success in practice, numerous mysteries persist regarding its theoretical understanding. In …
success in practice, numerous mysteries persist regarding its theoretical understanding. In …
Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks
We study the implicit bias of the general family of steepest descent algorithms, which
includes gradient descent, sign descent and coordinate descent, in deep homogeneous …
includes gradient descent, sign descent and coordinate descent, in deep homogeneous …
Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling
In this work, we investigate the margin-maximization bias exhibited by gradient-based
algorithms in classifying linearly separable data. We present an in-depth analysis of the …
algorithms in classifying linearly separable data. We present an in-depth analysis of the …