Nonconvex optimization meets low-rank matrix factorization: An overview
Substantial progress has been made recently on developing provably accurate and efficient
algorithms for low-rank matrix factorization via nonconvex optimization. While conventional …
algorithms for low-rank matrix factorization via nonconvex optimization. While conventional …
On the implicit bias in deep-learning algorithms
G Vardi - Communications of the ACM, 2023 - dl.acm.org
On the Implicit Bias in Deep-Learning Algorithms Page 1 DEEP LEARNING HAS been highly
successful in recent years and has led to dramatic improvements in multiple domains …
successful in recent years and has led to dramatic improvements in multiple domains …
Fine-tuning can distort pretrained features and underperform out-of-distribution
When transferring a pretrained model to a downstream task, two popular methods are full
fine-tuning (updating all the model parameters) and linear probing (updating only the last …
fine-tuning (updating all the model parameters) and linear probing (updating only the last …
On the opportunities and risks of foundation models
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
Deep learning: a statistical viewpoint
The remarkable practical success of deep learning has revealed some major surprises from
a theoretical perspective. In particular, simple gradient methods easily find near-optimal …
a theoretical perspective. In particular, simple gradient methods easily find near-optimal …
Trained transformers learn linear models in-context
Attention-based neural networks such as transformers have demonstrated a remarkable
ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an …
ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an …
Understanding gradient descent on the edge of stability in deep learning
Deep learning experiments by\citet {cohen2021gradient} using deterministic Gradient
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …
Birth of a transformer: A memory viewpoint
A Bietti, V Cabannes, D Bouchacourt… - Advances in …, 2024 - proceedings.neurips.cc
Large language models based on transformers have achieved great empirical successes.
However, as they are deployed more widely, there is a growing need to better understand …
However, as they are deployed more widely, there is a growing need to better understand …
Robust training under label noise by over-parameterization
Recently, over-parameterized deep networks, with increasingly more network parameters
than training samples, have dominated the performances of modern machine learning …
than training samples, have dominated the performances of modern machine learning …
Gradient starvation: A learning proclivity in neural networks
We identify and formalize a fundamental gradient descent phenomenon resulting in a
learning proclivity in over-parameterized neural networks. Gradient Starvation arises when …
learning proclivity in over-parameterized neural networks. Gradient Starvation arises when …