Dynamics of finite width kernel and prediction fluctuations in mean field neural networks

B Bordelon, C Pehlevan - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …

Lora+: Efficient low rank adaptation of large models

S Hayou, N Ghosh, B Yu - arXiv preprint arXiv:2402.12354, 2024 - arxiv.org
In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et
al.(2021) leads to suboptimal finetuning of models with large width (embedding dimension) …

Steering Deep Feature Learning with Backward Aligned Feature Updates

L Chizat, P Netrapalli - arXiv preprint arXiv:2311.18718, 2023 - arxiv.org
Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-
Parameters (HP) such as initialization scales, learning rates etc., only give indirect control …

Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning

L Noci, A Meterez, T Hofmann, A Orvieto - arXiv preprint arXiv:2402.17457, 2024 - arxiv.org
Recently, there has been growing evidence that if the width and depth of a neural network
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …

Scaling Exponents Across Parameterizations and Optimizers

K Everett, L Xiao, M Wortsman, AA Alemi… - arXiv preprint arXiv …, 2024 - arxiv.org
Robust and effective scaling of models from small to large width typically requires the
precise adjustment of many algorithmic and architectural details, such as parameterization …

Infinite Limits of Multi-head Transformer Dynamics

B Bordelon, HT Chaudhry, C Pehlevan - arXiv preprint arXiv:2405.15712, 2024 - arxiv.org
In this work, we analyze various scaling limits of the training dynamics of transformer models
in the feature learning regime. We identify the set of parameterizations that admit well …

A gradient flow on control space with rough initial condition

P Gassiat, F Suciu - arXiv preprint arXiv:2407.11817, 2024 - arxiv.org
We consider the (sub-Riemannian type) control problem of finding a path going from an
initial point $ x $ to a target point $ y $, by only moving in certain admissible directions. We …

Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling

M Haas, J Xu, V Cevher, LC Vankadara - High-dimensional Learning … - openreview.net
Sharpness Aware Minimization (SAM) enhances performance across various neural
architectures and datasets. As models are continually scaled up to improve performance, a …