Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization

K Wen, Z Li, T Ma - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Despite extensive studies, the underlying reason as to why overparameterizedneural
networks can generalize remains elusive. Existing theory shows that common stochastic …

How Sharpness-Aware Minimization Minimizes Sharpness?

K Wen, T Ma, Z Li - The Eleventh International Conference on …, 2023 - openreview.net
Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for
improving the generalization of deep neural networks for various settings. However, the …

Gradient descent with linearly correlated noise: Theory and applications to differential privacy

A Koloskova, R McKenna, Z Charles… - Advances in …, 2023 - proceedings.neurips.cc
We study gradient descent under linearly correlated noise. Our work is motivated by recent
practical methods for optimization with differential privacy (DP), such as DP-FTRL, which …

How does sharpness-aware minimization minimize sharpness?

K Wen, T Ma, Z Li - arXiv preprint arXiv:2211.05729, 2022 - arxiv.org
Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for
improving the generalization of deep neural networks for various settings. However, the …

Optimized injection of noise in activation functions to improve generalization of neural networks

F Duan, F Chapeau-Blondeau, D Abbott - Chaos, Solitons & Fractals, 2024 - Elsevier
This paper proposes a flexible probabilistic activation function that enhances the training
and operation of artificial neural networks by intentionally injecting noise to gain additional …

Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent

A Ghosh, H Lyu, X Zhang, R Wang - arXiv preprint arXiv:2302.00849, 2023 - arxiv.org
It is well known that the finite step-size ($ h $) in Gradient Descent (GD) implicitly regularizes
solutions to flatter minima. A natural question to ask is" Does the momentum parameter …

On the theoretical properties of noise correlation in stochastic optimization

A Lucchi, F Proske, A Orvieto… - Advances in Neural …, 2022 - proceedings.neurips.cc
Studying the properties of stochastic noise to optimize complex non-convex functions has
been an active area of research in the field of machine learning. Prior work~\citep …

PAC-tuning: Fine-tuning Pretrained Language Models with PAC-driven Perturbed Gradient Descent

G Liu, Z Xue, X Zhang, KM Johnson… - arXiv preprint arXiv …, 2023 - arxiv.org
Fine-tuning pretrained language models (PLMs) for downstream tasks is a large-scale
optimization problem, in which the choice of the training algorithm critically determines how …

GIFT-SW: Gaussian noise injected fine-tuning of salient weights for LLMs

M Zhelnin, V Moskvoretskii, E Shvetsov… - arXiv preprint arXiv …, 2024 - arxiv.org
Parameter Efficient Fine-Tuning (PEFT) methods have gained popularity and democratized
the usage of Large Language Models (LLMs). Recent studies have shown that a small …

Why is parameter averaging beneficial in SGD? An objective smoothing perspective

A Nitanda, R Kikuchi, S Maeda… - … Conference on Artificial …, 2024 - proceedings.mlr.press
It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a
solution with good generalization performance; such implicit bias is often characterized in …