Transformers as support vector machines

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc

Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

被引用次数：43 相关文章所有 6 个版本

[PDF] arxiv.org

In-context convergence of transformers

Y Huang, Y Cheng, Y Liang - arXiv preprint arXiv:2310.05249, 2023 - arxiv.org

Transformers have recently revolutionized many domains in modern machine learning and
one salient discovery is their remarkable in-context learning capability, where models can …

被引用次数：55 相关文章所有 6 个版本

[PDF] arxiv.org

Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention

Y Tian, Y Wang, Z Zhang, B Chen, S Du - arXiv preprint arXiv:2310.00535, 2023 - arxiv.org

We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to
understand the training procedure of multilayer Transformer architectures. This is achieved …

被引用次数：40 相关文章所有 6 个版本

[PDF] arxiv.org

On the optimization and generalization of multi-head attention

P Deora, R Ghaderi, H Taheri… - arXiv preprint arXiv …, 2023 - arxiv.org

The training and generalization dynamics of the Transformer's core mechanism, namely the
Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on …

被引用次数：31 相关文章所有 3 个版本

[PDF] arxiv.org

Transformers learn higher-order optimization methods for in-context learning: A study with linear models

D Fu, TQ Chen, R Jia, V Sharan - arXiv preprint arXiv:2310.17086, 2023 - arxiv.org

Transformers are remarkably good at in-context learning (ICL)--learning from
demonstrations without parameter updates--but how they perform ICL remains a mystery …

被引用次数：40 相关文章所有 4 个版本

[PDF] arxiv.org

How transformers learn causal structure with gradient descent

E Nichani, A Damian, JD Lee - arXiv preprint arXiv:2402.14735, 2024 - arxiv.org

The incredible success of transformers on sequence modeling tasks can be largely
attributed to the self-attention mechanism, which allows information to be transferred …

被引用次数：44 相关文章所有 3 个版本

[PDF] mlr.press

Mechanics of next token prediction with self-attention

Y Li, Y Huang, ME Ildiz, AS Rawat… - International …, 2024 - proceedings.mlr.press

Transformer-based language models are trained on large datasets to predict the next token
given an input sequence. Despite this simple training objective, they have led to …

被引用次数：21 相关文章所有 6 个版本

[PDF] researchgate.net

[PDF][PDF] How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

H Li, M Wang, S Lu, X Cui, PY Chen - arXiv preprint arXiv …, 2024 - researchgate.net

Transformer-based large language models have displayed impressive in-context learning
capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality

S Chen, H Sheen, T Wang, Z Yang - arXiv preprint arXiv:2402.19442, 2024 - arxiv.org

We study the dynamics of gradient flow for training a multi-head softmax attention model for
in-context learning of multi-task linear regression. We establish the global convergence of …

被引用次数：33 相关文章所有 2 个版本

[PDF] arxiv.org

An information-theoretic analysis of in-context learning

HJ Jeon, JD Lee, Q Lei, B Van Roy - arXiv preprint arXiv:2401.15530, 2024 - arxiv.org

Previous theoretical results pertaining to meta-learning on sequences build on contrived
assumptions and are somewhat convoluted. We introduce new information-theoretic tools …

被引用次数：21 相关文章所有 3 个版本