Larger language models do in-context learning differently

J Wei, J Wei, Y Tay, D Tran, A Webson, Y Lu… - arXiv preprint arXiv …, 2023 - arxiv.org
We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …

Hidden progress in deep learning: Sgd learns parities near the computational limit

B Barak, B Edelman, S Goel… - Advances in …, 2022 - proceedings.neurips.cc
There is mounting evidence of emergent phenomena in the capabilities of deep learning
methods as we scale up datasets, model sizes, and training times. While there are some …

Provable guarantees for neural networks via gradient feature learning

Z Shi, J Wei, Y Liang - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Neural networks have achieved remarkable empirical performance, while the current
theoretical analysis is not adequate for understanding their success, eg, the Neural Tangent …

On the Convergence and Sample Complexity Analysis of Deep Q-Networks with -Greedy Exploration

S Zhang, H Li, M Wang, M Liu… - Advances in …, 2024 - proceedings.neurips.cc
This paper provides a theoretical understanding of deep Q-Network (DQN) with the
$\varepsilon $-greedy exploration in deep reinforcement learning. Despite the tremendous …

Patch-level routing in mixture-of-experts is provably sample-efficient for convolutional neural networks

MNR Chowdhury, S Zhang, M Wang… - International …, 2023 - proceedings.mlr.press
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a
per-sample or per-token basis, resulting in significant computation reduction. The recently …

Looped relu mlps may be all you need as practical programmable computers

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2410.09375, 2024 - arxiv.org
Previous work has demonstrated that attention mechanisms are Turing complete. More
recently, it has been shown that a looped 13-layer Transformer can function as a universal …

A theory of non-linear feature learning with one gradient step in two-layer neural networks

B Moniri, D Lee, H Hassani, E Dobriban - arXiv preprint arXiv:2310.07891, 2023 - arxiv.org
Feature learning is thought to be one of the fundamental reasons for the success of deep
neural networks. It is rigorously known that in two-layer fully-connected neural networks …

Unraveling the smoothness properties of diffusion models: A gaussian mixture perspective

Y Liang, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2405.16418, 2024 - arxiv.org
Diffusion models have made rapid progress in generating high-quality samples across
various domains. However, a theoretical understanding of the Lipschitz continuity and …

Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic

J Gu, C Li, Y Liang, Z Shi, Z Song… - arXiv preprint arXiv …, 2024 - openreview.net
In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the
internal representations harnessed by neural networks and Transformers. Building on recent …

Bilevel coreset selection in continual learning: A new formulation and algorithm

J Hao, K Ji, M Liu - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Coreset is a small set that provides a data summary for a large dataset, such that training
solely on the small set achieves competitive performance compared with a large dataset. In …