Larger language models do in-context learning differently
We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …
Hidden progress in deep learning: Sgd learns parities near the computational limit
There is mounting evidence of emergent phenomena in the capabilities of deep learning
methods as we scale up datasets, model sizes, and training times. While there are some …
methods as we scale up datasets, model sizes, and training times. While there are some …
Provable guarantees for neural networks via gradient feature learning
Neural networks have achieved remarkable empirical performance, while the current
theoretical analysis is not adequate for understanding their success, eg, the Neural Tangent …
theoretical analysis is not adequate for understanding their success, eg, the Neural Tangent …
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with -Greedy Exploration
This paper provides a theoretical understanding of deep Q-Network (DQN) with the
$\varepsilon $-greedy exploration in deep reinforcement learning. Despite the tremendous …
$\varepsilon $-greedy exploration in deep reinforcement learning. Despite the tremendous …
Patch-level routing in mixture-of-experts is provably sample-efficient for convolutional neural networks
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a
per-sample or per-token basis, resulting in significant computation reduction. The recently …
per-sample or per-token basis, resulting in significant computation reduction. The recently …
Looped relu mlps may be all you need as practical programmable computers
Previous work has demonstrated that attention mechanisms are Turing complete. More
recently, it has been shown that a looped 13-layer Transformer can function as a universal …
recently, it has been shown that a looped 13-layer Transformer can function as a universal …
A theory of non-linear feature learning with one gradient step in two-layer neural networks
Feature learning is thought to be one of the fundamental reasons for the success of deep
neural networks. It is rigorously known that in two-layer fully-connected neural networks …
neural networks. It is rigorously known that in two-layer fully-connected neural networks …
Unraveling the smoothness properties of diffusion models: A gaussian mixture perspective
Diffusion models have made rapid progress in generating high-quality samples across
various domains. However, a theoretical understanding of the Lipschitz continuity and …
various domains. However, a theoretical understanding of the Lipschitz continuity and …
Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic
In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the
internal representations harnessed by neural networks and Transformers. Building on recent …
internal representations harnessed by neural networks and Transformers. Building on recent …
Bilevel coreset selection in continual learning: A new formulation and algorithm
Coreset is a small set that provides a data summary for a large dataset, such that training
solely on the small set achieves competitive performance compared with a large dataset. In …
solely on the small set achieves competitive performance compared with a large dataset. In …