Large scale distributed hessian-free optimization for deep neural network

P Xu, F Roosta, MW Mahoney - Mathematical Programming, 2020 - Springer

We consider variants of trust-region and adaptive cubic regularization methods for non-
convex optimization, in which the Hessian matrix is approximated. Under certain condition …

被引用次数：237 相关文章所有 8 个版本

[PDF] github.io

[PDF][PDF] Distributed Second-Order Optimization using Kronecker-Factored Approximations.

J Ba, RB Grosse, J Martens - ICLR (Poster), 2017 - jimmylba.github.io

As more computational resources become available, machine learning researchers train
ever larger neural networks on millions of data points using stochastic gradient descent …

被引用次数：121 相关文章所有 3 个版本

[PDF] arxiv.org

Inexact non-convex Newton-type methods

Z Yao, P Xu, F Roosta-Khorasani… - arXiv preprint arXiv …, 2018 - arxiv.org

For solving large-scale non-convex problems, we propose inexact variants of trust region
and adaptive cubic regularization methods, which, to increase efficiency, incorporate various …

被引用次数：46 相关文章所有 3 个版本

[PDF] informs.org

Inexact nonconvex newton-type methods

Z Yao, P Xu, F Roosta… - INFORMS Journal on …, 2021 - pubsonline.informs.org

For solving large-scale nonconvex problems, we propose inexact variants of trust region and
adaptive cubic regularization methods, which, to increase efficiency, incorporate various …

被引用次数：25 相关文章所有 2 个版本

[PDF] arxiv.org

Distributed newton methods for deep neural networks

CC Wang, KL Tan, CT Chen, YH Lin… - Neural …, 2018 - ieeexplore.ieee.org

Deep learning involves a difficult nonconvex optimization problem with a large number of
weights between any two adjacent layers of a deep structure. To handle large data sets or …

被引用次数：25 相关文章所有 9 个版本

[PDF] arxiv.org

Block-diagonal hessian-free optimization for training neural networks

H Zhang, C Xiong, J Bradbury, R Socher - arXiv preprint arXiv:1712.07296, 2017 - arxiv.org

Second-order methods for neural network optimization have several advantages over
methods based on first-order gradient descent, including better scaling to large mini-batch …

被引用次数：17 相关文章所有 3 个版本

[PDF] ntu.edu.tw

Newton methods for convolutional neural networks

CC Wang, KL Tan, CJ Lin - … on Intelligent Systems and Technology (TIST …, 2020 - dl.acm.org

Deep learning involves a difficult non-convex optimization problem, which is often solved by
stochastic gradient (SG) methods. While SG is usually effective, it may not be robust in some …

被引用次数：12 相关文章所有 2 个版本

[PDF] proquest.com

High-order automatic differentiation of unmodified linear algebra routines via nilpotent matrices

BZ Dunham - 2017 - search.proquest.com

This work presents a new automatic differentiation method, Nilpotent Matrix Differentiation
(NMD), capable of propagating any order of mixed or univariate derivative through common …

被引用次数：5 相关文章所有 4 个版本

[PDF] escholarship.org

[图书][B] Efficient Second-Order Methods for Non-Convex Optimization and Machine Learning

Z Yao - 2021 - search.proquest.com

Hessian-based analysis/computation is widely used in scientific computing. However, due to
the (incorrect, but in our experience widespread) belief that Hessian-based computations …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Newton methods for convolutional neural networks

CC Wang, KL Tan, CJ Lin - arXiv preprint arXiv:1811.06100, 2018 - arxiv.org

Deep learning involves a difficult non-convex optimization problem, which is often solved by
stochastic gradient (SG) methods. While SG is usually effective, it may not be robust in some …

被引用次数：4 相关文章所有 2 个版本