Backtracking gradient descent method for general functions, with applications to Deep Learning

TT Truong, TH Nguyen - arXiv preprint arXiv:1808.05160, 2018 - arxiv.org
TT Truong, TH Nguyen
arXiv preprint arXiv:1808.05160, 2018arxiv.org
While Standard gradient descent is one very popular optimisation method, its convergence
cannot be proven beyond the class of functions whose gradient is globally Lipschitz
continuous. As such, it is not actually applicable to realistic applications such as Deep
Neural Networks. In this paper, we prove that its backtracking variant behaves very nicely, in
particular convergence can be shown for all Morse functions. The main theoretical result of
this paper is as follows. Theorem. Let $ f:\mathbb {R}^ k\rightarrow\mathbb {R} $ be a $ C …
While Standard gradient descent is one very popular optimisation method, its convergence cannot be proven beyond the class of functions whose gradient is globally Lipschitz continuous. As such, it is not actually applicable to realistic applications such as Deep Neural Networks. In this paper, we prove that its backtracking variant behaves very nicely, in particular convergence can be shown for all Morse functions. The main theoretical result of this paper is as follows. Theorem. Let be a function, and a sequence constructed from the Backtracking gradient descent algorithm. (1) Either or . (2) Assume that has at most countably many critical points. Then either or converges to a critical point of . (3) More generally, assume that all connected components of the set of critical points of are compact. Then either or is bounded. Moreover, in the latter case the set of cluster points of is connected. Some generalised versions of this result, including an inexact version, are included. Another result in this paper concerns the problem of saddle points. We then present a heuristic argument to explain why Standard gradient descent method works so well, and modifications of the backtracking versions of GD, MMT and NAG. Experiments with datasets CIFAR10 and CIFAR100 on various popular architectures verify the heuristic argument also for the mini-batch practice and show that our new algorithms, while automatically fine tuning learning rates, perform better than current state-of-the-art methods such as MMT, NAG, Adagrad, Adadelta, RMSProp, Adam and Adamax.
arxiv.org
以上显示的是最相近的搜索结果。 查看全部搜索结果