Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

In-context convergence of transformers

Y Huang, Y Cheng, Y Liang - arXiv preprint arXiv:2310.05249, 2023 - arxiv.org
Transformers have recently revolutionized many domains in modern machine learning and
one salient discovery is their remarkable in-context learning capability, where models can …

Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention

Y Tian, Y Wang, Z Zhang, B Chen, S Du - arXiv preprint arXiv:2310.00535, 2023 - arxiv.org
We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to
understand the training procedure of multilayer Transformer architectures. This is achieved …

On the optimization and generalization of multi-head attention

P Deora, R Ghaderi, H Taheri… - arXiv preprint arXiv …, 2023 - arxiv.org
The training and generalization dynamics of the Transformer's core mechanism, namely the
Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on …

Transformers learn higher-order optimization methods for in-context learning: A study with linear models

D Fu, TQ Chen, R Jia, V Sharan - arXiv preprint arXiv:2310.17086, 2023 - arxiv.org
Transformers are remarkably good at in-context learning (ICL)--learning from
demonstrations without parameter updates--but how they perform ICL remains a mystery …

How transformers learn causal structure with gradient descent

E Nichani, A Damian, JD Lee - arXiv preprint arXiv:2402.14735, 2024 - arxiv.org
The incredible success of transformers on sequence modeling tasks can be largely
attributed to the self-attention mechanism, which allows information to be transferred …

Mechanics of next token prediction with self-attention

Y Li, Y Huang, ME Ildiz, AS Rawat… - International …, 2024 - proceedings.mlr.press
Transformer-based language models are trained on large datasets to predict the next token
given an input sequence. Despite this simple training objective, they have led to …

[PDF][PDF] How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

H Li, M Wang, S Lu, X Cui, PY Chen - arXiv preprint arXiv …, 2024 - researchgate.net
Transformer-based large language models have displayed impressive in-context learning
capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply …

Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality

S Chen, H Sheen, T Wang, Z Yang - arXiv preprint arXiv:2402.19442, 2024 - arxiv.org
We study the dynamics of gradient flow for training a multi-head softmax attention model for
in-context learning of multi-task linear regression. We establish the global convergence of …

An information-theoretic analysis of in-context learning

HJ Jeon, JD Lee, Q Lei, B Van Roy - arXiv preprint arXiv:2401.15530, 2024 - arxiv.org
Previous theoretical results pertaining to meta-learning on sequences build on contrived
assumptions and are somewhat convoluted. We introduce new information-theoretic tools …