Signal propagation in transformers: Theoretical perspectives and the role of rank collapse
Transformers have achieved remarkable success in several domains, ranging from natural
language processing to computer vision. Nevertheless, it has been recently shown that …
language processing to computer vision. Nevertheless, it has been recently shown that …
Evaluation of classification models in limited data scenarios with application to additive manufacturing
F Pourkamali-Anaraki, T Nasrin, RE Jensen… - … Applications of Artificial …, 2023 - Elsevier
This paper presents a novel framework that enables the generation of unbiased estimates
for test loss using fewer labeled samples, effectively evaluating the predictive performance …
for test loss using fewer labeled samples, effectively evaluating the predictive performance …
Heavy-tailed class imbalance and why adam outperforms gradient descent on language models
Adam has been shown to outperform gradient descent in optimizing large language
transformers empirically, and by a larger margin than on other tasks, but it is unclear why this …
transformers empirically, and by a larger margin than on other tasks, but it is unclear why this …
MetaFL: Privacy-preserving User Authentication in Virtual Reality with Federated Learning
The increasing popularity of virtual reality (VR) has stressed the importance of authenticating
VR users while preserving their privacy. Behavioral biometrics, owing to their robustness …
VR users while preserving their privacy. Behavioral biometrics, owing to their robustness …
An adaptive stochastic gradient method with non-negative gauss-newton stepsizes
We consider the problem of minimizing the average of a large number of smooth but
possibly non-convex functions. In the context of most machine learning applications, each …
possibly non-convex functions. In the context of most machine learning applications, each …
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
Recently, there has been growing evidence that if the width and depth of a neural network
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …
Initial guessing bias: How untrained networks favor some classes
Understanding and controlling biasing effects in neural networks is crucial for ensuring
accurate and fair model performance. In the context of classification problems, we provide a …
accurate and fair model performance. In the context of classification problems, we provide a …
Deconstructing the Goldilocks Zone of Neural Network Initialization
The second-order properties of the training loss have a massive impact on the optimization
dynamics of deep learning models. Fort & Scherlis (2019) discovered that a high positive …
dynamics of deep learning models. Fort & Scherlis (2019) discovered that a high positive …
Super Consistency of Neural Network Landscapes and Learning Rate Transfer
Recently, there has been growing evidence that if the width and depth of a neural network
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …
FOSI: Hybrid First and Second Order Optimization
H Sivan, M Gabel, A Schuster - arXiv preprint arXiv:2302.08484, 2023 - arxiv.org
Though second-order optimization methods are highly effective, popular approaches in
machine learning such as SGD and Adam use only first-order information due to the difficulty …
machine learning such as SGD and Adam use only first-order information due to the difficulty …