Larger language models do in-context learning differently
We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …
Scan and snap: Understanding training dynamics and token composition in 1-layer transformer
Transformer architecture has shown impressive performance in multiple research domains
and has become the backbone of many neural network models. However, there is limited …
and has become the backbone of many neural network models. However, there is limited …
Max-margin token selection in attention mechanism
D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …
phenomenal success of large language models. However, the theoretical principles …
What can a single attention layer learn? a study through the random features lens
Attention layers---which map a sequence of inputs to a sequence of outputs---are core
building blocks of the Transformer architecture which has achieved significant …
building blocks of the Transformer architecture which has achieved significant …
White-box transformers via sparse rate reduction
In this paper, we contend that the objective of representation learning is to compress and
transform the distribution of the data, say sets of tokens, towards a mixture of low …
transform the distribution of the data, say sets of tokens, towards a mixture of low …
On the role of attention in prompt-tuning
Prompt-tuning is an emerging strategy to adapt large language models (LLM) to
downstream tasks by learning a (soft-) prompt parameter from data. Despite its success in …
downstream tasks by learning a (soft-) prompt parameter from data. Despite its success in …
Transformers as support vector machines
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …
revolutionary advancements in NLP. The attention layer within the transformer admits a …
In-context convergence of transformers
Transformers have recently revolutionized many domains in modern machine learning and
one salient discovery is their remarkable in-context learning capability, where models can …
one salient discovery is their remarkable in-context learning capability, where models can …
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with -Greedy Exploration
This paper provides a theoretical understanding of deep Q-Network (DQN) with the
$\varepsilon $-greedy exploration in deep reinforcement learning. Despite the tremendous …
$\varepsilon $-greedy exploration in deep reinforcement learning. Despite the tremendous …
Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention
We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to
understand the training procedure of multilayer Transformer architectures. This is achieved …
understand the training procedure of multilayer Transformer architectures. This is achieved …