Max-margin token selection in attention mechanism
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …
phenomenal success of large language models. However, the theoretical principles …
In-context convergence of transformers
Transformers have recently revolutionized many domains in modern machine learning and
one salient discovery is their remarkable in-context learning capability, where models can …
one salient discovery is their remarkable in-context learning capability, where models can …
Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention
We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to
understand the training procedure of multilayer Transformer architectures. This is achieved …
understand the training procedure of multilayer Transformer architectures. This is achieved …
On the optimization and generalization of multi-head attention
The training and generalization dynamics of the Transformer's core mechanism, namely the
Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on …
Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on …
Transformers learn higher-order optimization methods for in-context learning: A study with linear models
Transformers are remarkably good at in-context learning (ICL)--learning from
demonstrations without parameter updates--but how they perform ICL remains a mystery …
demonstrations without parameter updates--but how they perform ICL remains a mystery …
How transformers learn causal structure with gradient descent
The incredible success of transformers on sequence modeling tasks can be largely
attributed to the self-attention mechanism, which allows information to be transferred …
attributed to the self-attention mechanism, which allows information to be transferred …
Mechanics of next token prediction with self-attention
Transformer-based language models are trained on large datasets to predict the next token
given an input sequence. Despite this simple training objective, they have led to …
given an input sequence. Despite this simple training objective, they have led to …
[PDF][PDF] How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?
Transformer-based large language models have displayed impressive in-context learning
capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply …
capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply …
Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality
We study the dynamics of gradient flow for training a multi-head softmax attention model for
in-context learning of multi-task linear regression. We establish the global convergence of …
in-context learning of multi-task linear regression. We establish the global convergence of …
An information-theoretic analysis of in-context learning
Previous theoretical results pertaining to meta-learning on sequences build on contrived
assumptions and are somewhat convoluted. We introduce new information-theoretic tools …
assumptions and are somewhat convoluted. We introduce new information-theoretic tools …