Larger language models do in-context learning differently
We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …
Tensor attention training: Provably efficient learning of higher-order transformers
Tensor Attention, a multi-view attention that is able to capture high-order correlations among
multiple modalities, can overcome the representational limitations of classical matrix …
multiple modalities, can overcome the representational limitations of classical matrix …
Conv-basis: A new paradigm for efficient attention inference and gradient computation in transformers
The self-attention mechanism is the key to the success of transformers in recent Large
Language Models (LLMs). However, the quadratic computational cost $ O (n^ 2) $ in the …
Language Models (LLMs). However, the quadratic computational cost $ O (n^ 2) $ in the …
Multi-layer transformers gradient can be approximated in almost linear time
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …
architectures poses significant challenges for training and inference, and becomes the …
Hsr-enhanced sparse attention acceleration
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
applications, but their performance on long-context tasks is often limited by the …
applications, but their performance on long-context tasks is often limited by the …
Is a picture worth a thousand words? delving into spatial reasoning for vision language models
Large language models (LLMs) and vision-language models (VLMs) have demonstrated
remarkable performance across a wide range of tasks and domains. Despite this promise …
remarkable performance across a wide range of tasks and domains. Despite this promise …
Looped relu mlps may be all you need as practical programmable computers
Previous work has demonstrated that attention mechanisms are Turing complete. More
recently, it has been shown that a looped 13-layer Transformer can function as a universal …
recently, it has been shown that a looped 13-layer Transformer can function as a universal …
Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent
In-context learning has been recognized as a key factor in the success of Large Language
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …
Differentially private attention computation
Large language models (LLMs) have had a profound impact on numerous aspects of daily
life including natural language processing, content generation, research methodologies and …
life including natural language processing, content generation, research methodologies and …
Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic
In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the
internal representations harnessed by neural networks and Transformers. Building on recent …
internal representations harnessed by neural networks and Transformers. Building on recent …