Loss Curve Approximations for Fast Neural Architecture Ranking & Training Elasticity Estimation

D Zhao, NC Frey, V Gadepally… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
2022 IEEE International Parallel and Distributed Processing …, 2022ieeexplore.ieee.org
Two key questions for any deep learning task involve questions around its optimization.
First, when should we stop training or, alternatively, how long should we train for before the
gains are not worth the continued training (ie early or optimal stopping)? Secondly, what is
the “right” or best model: what training settings, hyper-parameters, and model architecture
are best for the task at hand to maximize performance (ie architecture search)? Though
essential, these questions are arguably also the most expensive parts of deep learning …
Two key questions for any deep learning task involve questions around its optimization. First, when should we stop training or, alternatively, how long should we train for before the gains are not worth the continued training (i.e. early or optimal stopping)? Secondly, what is the “right” or best model: what training settings, hyper-parameters, and model architecture are best for the task at hand to maximize performance (i.e. architecture search)? Though essential, these questions are arguably also the most expensive parts of deep learning experimentation and the most unclear. Moreover, these expensive, exhaustive searches require large computational budgets that can carry large environmental footprints and significant energy expenditure. In this paper, we introduce a new method we call the Loss Curve Gradient Approximation (LCGA) that ranks model performance with minimal training. Using a wide variety of popular deep vision models, we test its predictive power and performance across different neural architectures and training settings. For a comparative analysis, we benchmark the performance of LCGA against an existing technique, Training Speed Estimation (TSE), used in architecture search and performance ranking and show that LCGA can significantly outperform TSE while still holding the same advantages in terms of ease, speed, and efficiency. Lastly, we describe potential applications of LCGA beyond its primary application: namely, (1) combining collected experimental data with LCGA to develop train-less NAS and (2) presenting a framework to more rigorously guide early stopping in training by borrowing concepts of demand elasticity from economics.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果

Google学术搜索按钮

example.edu/paper.pdf
查找
获取 PDF 文件
引用
References