Dynamics of finite width kernel and prediction fluctuations in mean field neural networks
B Bordelon, C Pehlevan - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …
networks. Starting from a dynamical mean field theory description of infinite width deep …
Lora+: Efficient low rank adaptation of large models
In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et
al.(2021) leads to suboptimal finetuning of models with large width (embedding dimension) …
al.(2021) leads to suboptimal finetuning of models with large width (embedding dimension) …
Steering Deep Feature Learning with Backward Aligned Feature Updates
L Chizat, P Netrapalli - arXiv preprint arXiv:2311.18718, 2023 - arxiv.org
Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-
Parameters (HP) such as initialization scales, learning rates etc., only give indirect control …
Parameters (HP) such as initialization scales, learning rates etc., only give indirect control …
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
Recently, there has been growing evidence that if the width and depth of a neural network
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …
Scaling Exponents Across Parameterizations and Optimizers
Robust and effective scaling of models from small to large width typically requires the
precise adjustment of many algorithmic and architectural details, such as parameterization …
precise adjustment of many algorithmic and architectural details, such as parameterization …
Infinite Limits of Multi-head Transformer Dynamics
In this work, we analyze various scaling limits of the training dynamics of transformer models
in the feature learning regime. We identify the set of parameterizations that admit well …
in the feature learning regime. We identify the set of parameterizations that admit well …
A gradient flow on control space with rough initial condition
We consider the (sub-Riemannian type) control problem of finding a path going from an
initial point $ x $ to a target point $ y $, by only moving in certain admissible directions. We …
initial point $ x $ to a target point $ y $, by only moving in certain admissible directions. We …
Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling
Sharpness Aware Minimization (SAM) enhances performance across various neural
architectures and datasets. As models are continually scaled up to improve performance, a …
architectures and datasets. As models are continually scaled up to improve performance, a …