Signal propagation in transformers: Theoretical perspectives and the role of rank collapse

L Noci, S Anagnostidis, L Biggio… - Advances in …, 2022 - proceedings.neurips.cc
Transformers have achieved remarkable success in several domains, ranging from natural
language processing to computer vision. Nevertheless, it has been recently shown that …

Evaluation of classification models in limited data scenarios with application to additive manufacturing

F Pourkamali-Anaraki, T Nasrin, RE Jensen… - … Applications of Artificial …, 2023 - Elsevier
This paper presents a novel framework that enables the generation of unbiased estimates
for test loss using fewer labeled samples, effectively evaluating the predictive performance …

Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

F Kunstner, R Yadav, A Milligan, M Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org
Adam has been shown to outperform gradient descent in optimizing large language
transformers empirically, and by a larger margin than on other tasks, but it is unclear why this …

MetaFL: Privacy-preserving User Authentication in Virtual Reality with Federated Learning

R Cheng, Y Wu, A Kundu, H Latapie, M Lee… - Proceedings of the …, 2024 - dl.acm.org
The increasing popularity of virtual reality (VR) has stressed the importance of authenticating
VR users while preserving their privacy. Behavioral biometrics, owing to their robustness …

An adaptive stochastic gradient method with non-negative gauss-newton stepsizes

A Orvieto, L Xiao - arXiv preprint arXiv:2407.04358, 2024 - arxiv.org
We consider the problem of minimizing the average of a large number of smooth but
possibly non-convex functions. In the context of most machine learning applications, each …

Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning

L Noci, A Meterez, T Hofmann, A Orvieto - arXiv preprint arXiv:2402.17457, 2024 - arxiv.org
Recently, there has been growing evidence that if the width and depth of a neural network
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …

Initial guessing bias: How untrained networks favor some classes

E Francazi, A Lucchi, M Baity-Jesi - arXiv preprint arXiv:2306.00809, 2023 - arxiv.org
Understanding and controlling biasing effects in neural networks is crucial for ensuring
accurate and fair model performance. In the context of classification problems, we provide a …

Deconstructing the Goldilocks Zone of Neural Network Initialization

A Vysogorets, A Dawid, J Kempe - arXiv preprint arXiv:2402.03579, 2024 - arxiv.org
The second-order properties of the training loss have a massive impact on the optimization
dynamics of deep learning models. Fort & Scherlis (2019) discovered that a high positive …

Super Consistency of Neural Network Landscapes and Learning Rate Transfer

L Noci, A Meterez, T Hofmann… - The Thirty-eighth Annual …, 2024 - openreview.net
Recently, there has been growing evidence that if the width and depth of a neural network
are scaled toward the so-called rich feature learning limit ($\mu $ P and its depth extension) …

FOSI: Hybrid First and Second Order Optimization

H Sivan, M Gabel, A Schuster - arXiv preprint arXiv:2302.08484, 2023 - arxiv.org
Though second-order optimization methods are highly effective, popular approaches in
machine learning such as SGD and Adam use only first-order information due to the difficulty …