Mechanics of next token prediction with self-attention

Y Li, Y Huang, ME Ildiz, AS Rawat… - International …, 2024 - proceedings.mlr.press
Transformer-based language models are trained on large datasets to predict the next token
given an input sequence. Despite this simple training objective, they have led to …

Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality

S Chen, H Sheen, T Wang, Z Yang - arXiv preprint arXiv:2402.19442, 2024 - arxiv.org
We study the dynamics of gradient flow for training a multi-head softmax attention model for
in-context learning of multi-task linear regression. We establish the global convergence of …

Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond

J Gu, C Li, Y Liang, Z Shi, Z Song - arXiv preprint arXiv:2405.03251, 2024 - arxiv.org
The softmax activation function plays a crucial role in the success of large language models
(LLMs), particularly in the self-attention mechanism of the widely adopted Transformer …

On the Power of Convolution Augmented Transformer

M Li, X Zhang, Y Huang, S Oymak - arXiv preprint arXiv:2407.05591, 2024 - arxiv.org
The transformer architecture has catalyzed revolutionary advances in language modeling.
However, recent architectural recipes, such as state-space models, have bridged the …

Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond

Y Li, AS Rawat, S Oymak - arXiv preprint arXiv:2407.10005, 2024 - arxiv.org
Recent research has shown that Transformers with linear attention are capable of in-context
learning (ICL) by implementing a linear estimator through gradient descent steps. However …

How In-Context Learning Emerges from Training on Unstructured Data: On the Role of Co-Occurrence, Positional Information, and Noise Structures

KC Wibisono, Y Wang - arXiv preprint arXiv:2406.00131, 2024 - arxiv.org
Large language models (LLMs) like transformers have impressive in-context learning (ICL)
capabilities; they can generate predictions for new queries based on input-output …

In-Context Learning from Training on Unstructured Data: The Role of Co-Occurrence, Positional Information, and Training Data Structure

KC Wibisono, Y Wang - ICML 2024 Workshop on Theoretical Foundations … - openreview.net
Large language models (LLMs) like transformers have impressive in-context learning (ICL)
capabilities; they can generate predictions for new queries based on input-output …