A theoretical analysis on feature learning in neural networks: Emergence from inputs and...

J Wei, J Wei, Y Tay, D Tran, A Webson, Y Lu… - arXiv preprint arXiv …, 2023 - arxiv.org

We study how in-context learning (ICL) in language models is affected by semantic priors
versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with …

被引用次数：287 相关文章所有 7 个版本

[PDF] neurips.cc

Hidden progress in deep learning: Sgd learns parities near the computational limit

B Barak, B Edelman, S Goel… - Advances in …, 2022 - proceedings.neurips.cc

There is mounting evidence of emergent phenomena in the capabilities of deep learning
methods as we scale up datasets, model sizes, and training times. While there are some …

被引用次数：134 相关文章所有 8 个版本

[PDF] neurips.cc

Provable guarantees for neural networks via gradient feature learning

Z Shi, J Wei, Y Liang - Advances in Neural Information …, 2023 - proceedings.neurips.cc

Neural networks have achieved remarkable empirical performance, while the current
theoretical analysis is not adequate for understanding their success, eg, the Neural Tangent …

被引用次数：15 相关文章所有 6 个版本

[PDF] neurips.cc

On the Convergence and Sample Complexity Analysis of Deep Q-Networks with -Greedy Exploration

S Zhang, H Li, M Wang, M Liu… - Advances in …, 2024 - proceedings.neurips.cc

This paper provides a theoretical understanding of deep Q-Network (DQN) with the
$\varepsilon $-greedy exploration in deep reinforcement learning. Despite the tremendous …

被引用次数：22 相关文章所有 8 个版本

[PDF] mlr.press

Patch-level routing in mixture-of-experts is provably sample-efficient for convolutional neural networks

MNR Chowdhury, S Zhang, M Wang… - International …, 2023 - proceedings.mlr.press

In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a
per-sample or per-token basis, resulting in significant computation reduction. The recently …

被引用次数：24 相关文章所有 11 个版本

[PDF] arxiv.org

Looped relu mlps may be all you need as practical programmable computers

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2410.09375, 2024 - arxiv.org

Previous work has demonstrated that attention mechanisms are Turing complete. More
recently, it has been shown that a looped 13-layer Transformer can function as a universal …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

A theory of non-linear feature learning with one gradient step in two-layer neural networks

B Moniri, D Lee, H Hassani, E Dobriban - arXiv preprint arXiv:2310.07891, 2023 - arxiv.org

Feature learning is thought to be one of the fundamental reasons for the success of deep
neural networks. It is rigorously known that in two-layer fully-connected neural networks …

被引用次数：23 相关文章所有 5 个版本

[PDF] arxiv.org

Unraveling the smoothness properties of diffusion models: A gaussian mixture perspective

Y Liang, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2405.16418, 2024 - arxiv.org

Diffusion models have made rapid progress in generating high-quality samples across
various domains. However, a theoretical understanding of the Lipschitz continuity and …

被引用次数：15 相关文章所有 2 个版本

[PDF] openreview.net

Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic

J Gu, C Li, Y Liang, Z Shi, Z Song… - arXiv preprint arXiv …, 2024 - openreview.net

In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the
internal representations harnessed by neural networks and Transformers. Building on recent …

被引用次数：17 相关文章所有 4 个版本

[PDF] neurips.cc

Bilevel coreset selection in continual learning: A new formulation and algorithm

J Hao, K Ji, M Liu - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Coreset is a small set that provides a data summary for a large dataset, such that training
solely on the small set achieves competitive performance compared with a large dataset. In …

被引用次数：13 相关文章所有 4 个版本