Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization
Despite extensive studies, the underlying reason as to why overparameterizedneural
networks can generalize remains elusive. Existing theory shows that common stochastic …
networks can generalize remains elusive. Existing theory shows that common stochastic …
How Sharpness-Aware Minimization Minimizes Sharpness?
Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for
improving the generalization of deep neural networks for various settings. However, the …
improving the generalization of deep neural networks for various settings. However, the …
Gradient descent with linearly correlated noise: Theory and applications to differential privacy
We study gradient descent under linearly correlated noise. Our work is motivated by recent
practical methods for optimization with differential privacy (DP), such as DP-FTRL, which …
practical methods for optimization with differential privacy (DP), such as DP-FTRL, which …
How does sharpness-aware minimization minimize sharpness?
Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for
improving the generalization of deep neural networks for various settings. However, the …
improving the generalization of deep neural networks for various settings. However, the …
Optimized injection of noise in activation functions to improve generalization of neural networks
This paper proposes a flexible probabilistic activation function that enhances the training
and operation of artificial neural networks by intentionally injecting noise to gain additional …
and operation of artificial neural networks by intentionally injecting noise to gain additional …
Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent
It is well known that the finite step-size ($ h $) in Gradient Descent (GD) implicitly regularizes
solutions to flatter minima. A natural question to ask is" Does the momentum parameter …
solutions to flatter minima. A natural question to ask is" Does the momentum parameter …
On the theoretical properties of noise correlation in stochastic optimization
Studying the properties of stochastic noise to optimize complex non-convex functions has
been an active area of research in the field of machine learning. Prior work~\citep …
been an active area of research in the field of machine learning. Prior work~\citep …
PAC-tuning: Fine-tuning Pretrained Language Models with PAC-driven Perturbed Gradient Descent
Fine-tuning pretrained language models (PLMs) for downstream tasks is a large-scale
optimization problem, in which the choice of the training algorithm critically determines how …
optimization problem, in which the choice of the training algorithm critically determines how …
GIFT-SW: Gaussian noise injected fine-tuning of salient weights for LLMs
M Zhelnin, V Moskvoretskii, E Shvetsov… - arXiv preprint arXiv …, 2024 - arxiv.org
Parameter Efficient Fine-Tuning (PEFT) methods have gained popularity and democratized
the usage of Large Language Models (LLMs). Recent studies have shown that a small …
the usage of Large Language Models (LLMs). Recent studies have shown that a small …
Why is parameter averaging beneficial in SGD? An objective smoothing perspective
A Nitanda, R Kikuchi, S Maeda… - … Conference on Artificial …, 2024 - proceedings.mlr.press
It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a
solution with good generalization performance; such implicit bias is often characterized in …
solution with good generalization performance; such implicit bias is often characterized in …