Mechanics of next token prediction with self-attention
Transformer-based language models are trained on large datasets to predict the next token
given an input sequence. Despite this simple training objective, they have led to …
given an input sequence. Despite this simple training objective, they have led to …
Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality
We study the dynamics of gradient flow for training a multi-head softmax attention model for
in-context learning of multi-task linear regression. We establish the global convergence of …
in-context learning of multi-task linear regression. We establish the global convergence of …
Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond
The softmax activation function plays a crucial role in the success of large language models
(LLMs), particularly in the self-attention mechanism of the widely adopted Transformer …
(LLMs), particularly in the self-attention mechanism of the widely adopted Transformer …
On the Power of Convolution Augmented Transformer
The transformer architecture has catalyzed revolutionary advances in language modeling.
However, recent architectural recipes, such as state-space models, have bridged the …
However, recent architectural recipes, such as state-space models, have bridged the …
Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond
Recent research has shown that Transformers with linear attention are capable of in-context
learning (ICL) by implementing a linear estimator through gradient descent steps. However …
learning (ICL) by implementing a linear estimator through gradient descent steps. However …
How In-Context Learning Emerges from Training on Unstructured Data: On the Role of Co-Occurrence, Positional Information, and Noise Structures
KC Wibisono, Y Wang - arXiv preprint arXiv:2406.00131, 2024 - arxiv.org
Large language models (LLMs) like transformers have impressive in-context learning (ICL)
capabilities; they can generate predictions for new queries based on input-output …
capabilities; they can generate predictions for new queries based on input-output …
In-Context Learning from Training on Unstructured Data: The Role of Co-Occurrence, Positional Information, and Training Data Structure
KC Wibisono, Y Wang - ICML 2024 Workshop on Theoretical Foundations … - openreview.net
Large language models (LLMs) like transformers have impressive in-context learning (ICL)
capabilities; they can generate predictions for new queries based on input-output …
capabilities; they can generate predictions for new queries based on input-output …