Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit

B Bordelon, C Pehlevan - Advances in Neural Information …, 2024 - proceedings.neurips.cc

We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …

被引用次数：20 相关文章所有 8 个版本

[PDF] arxiv.org

Lora+: Efficient low rank adaptation of large models

S Hayou, N Ghosh, B Yu - arXiv preprint arXiv:2402.12354, 2024 - arxiv.org

In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et
al.(2021) leads to suboptimal finetuning of models with large width (embedding dimension) …

被引用次数：17 相关文章所有 3 个版本

Steering Deep Feature Learning with Backward Aligned Feature Updates

L Chizat, P Netrapalli - arXiv preprint arXiv:2311.18718, 2023 - arxiv.org

Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-
Parameters (HP) such as initialization scales, learning rates etc., only give indirect control …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning

L Noci, A Meterez, T Hofmann, A Orvieto - arXiv preprint arXiv:2402.17457, 2024 - arxiv.org

Recently, there has been growing evidence that if the width and depth of a neural network
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Scaling Exponents Across Parameterizations and Optimizers

K Everett, L Xiao, M Wortsman, AA Alemi… - arXiv preprint arXiv …, 2024 - arxiv.org

Robust and effective scaling of models from small to large width typically requires the
precise adjustment of many algorithmic and architectural details, such as parameterization …

[PDF] arxiv.org