Gradient-only surrogate to resolve learning rates for robust and consistent training of deep neural networks

Chae, Younghwan; Wilke, Daniel N.; Kafka, Dominic

doi:10.1007/s10489-022-04206-8

Gradient-only surrogate to resolve learning rates for robust and consistent training of deep neural networks

Published: 15 October 2022

Volume 53, pages 13741–13762, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

205 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Mini-batch sub-sampling (MBSS) is favored in deep neural network training to reduce the computational cost. Still, it introduces an inherent sampling error, making the selection of appropriate learning rates challenging. The sampling errors can manifest either as a bias or variances in a line search. Dynamic MBSS re-samples a mini-batch at every function evaluation. Hence, dynamic MBSS results in point-wise discontinuous loss functions with smaller bias but larger variance than static sampled loss functions. However, dynamic MBSS has the advantage of having larger data throughput during training but requires resolving the complexity regarding discontinuities. This study extends the vanilla gradient-only surrogate line search (GOS-LS), a line search method using quadratic approximation models built with only directional derivative information for dynamic MBSS loss functions. We propose a conservative gradient-only surrogate line search (GOS-LSC) with strong convergence characteristics with a defined optimality criterion. For the first time, we investigate both GOS-LS’s and GOS-LSC’s performance on various optimizers, including SGD, RMSProp, and Adam on ResNet-18 and EfficientNet-B0. We also compare GOS-LS and GOS-LSC against the other existing learning rate methods. We quantify both the best-performing and most robust algorithms. For the latter, we introduce a relative robust criterion that allows us to quantify the difference between an algorithm and the best performing algorithm for a given problem. The results show that training a model with the recommended learning rate for a class of search directions helps to reduce the model errors in multimodal cases. The results also show that GOS-LS ranked first in training and test results, while GOS-LSC ranked third and second in training and test results among nine other learning rate strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Empirically Explaining SGD from a Line Search Perspective

Resolving learning rates adaptively by locating stochastic non-negative associated gradient projection points using line searches

Article 28 July 2020

Steplength and Mini-batch Size Selection in Stochastic Gradient Methods

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. software available from tensorflow.org. https://www.tensorflow.org/. Accessed 10 Aug 2021
Agarwal N, Bullins B, Hazan E (2017) Second-order stochastic optimization for machine learning in linear time. J Mach Learn Res 18(1):4148–4187
MathSciNet MATH Google Scholar
Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Neural networks tricks of the trade. Springer, pp 437–478
Bergou Eh, Diouane Y, Kunc V, Kungurtsev V, Royer CW (2018) A subsampling line search method with second-order results. arXiv:181007211
Bollapragada R, Byrd R, Nocedal J (2018) Adaptive sampling strategies for stochastic optimization. SIAM J Optim 28(4):3312–3343
Article MathSciNet MATH Google Scholar
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, Curran associates, Inc., vol 33, pp 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Accessed 04 Oct 2021
Byrd RH, Chin GM, Nocedal J, Wu Y (2012) Sample size selection in optimization methods for machine learning. Math Program 134(1):127–155. https://doi.org/10.1007/s10107-012-0572-5
Article MathSciNet MATH Google Scholar
Chae Y, Wilke DN (2019) Empirical study towards understanding line search approximations for training neural networks. arXiv:190906893
Friedlander MP, Schmidt M (2012) Hybrid deterministic-stochastic methods for data fitting. SIAM J Sci Comput 34(3):A1380–A1405
Article MathSciNet MATH Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR workshop and conference proceedings, pp 249–256
Goodfellow I, Bengio Y, Courville A (2016) Deep learning (Adaptive computation and machine learning series). The MIT Press
Gupta RK (2019) Numerical methods: fundamentals and applications. Cambridge University Press
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Kafka D, Wilke DN (2019) Gradient-only line searches: an alternative to probabilistic line searches. arXiv:190309383
Kafka D, Wilke DN (2021) Resolving learning rates adaptively by locating stochastic non-negative associated gradient projection points using line searches. J Glob Optim 79(1):111–152
Article MathSciNet MATH Google Scholar
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech rep, Department of Computer Science, University of Toronto
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Liu K (2020) 95.16% on CIFAR10 with PyTorch. https://github.com/kuangliu/pytorch-cifar. Accessed 12 June 2021
Loshchilov I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference track proceedings, OpenReview.net. https://openreview.net/forum?id=Skq89Scxx
Lyapunov AM (1992) The general problem of the stability of motion. Int J Control 55(3):531–534
Article MathSciNet Google Scholar
Mahsereci M, Hennig P (2017) Probabilistic line searches for stochastic optimization. J Mach Learn Res 18(1):4262–4320
MathSciNet MATH Google Scholar
Masters D, Luschi C (2018) Revisiting small batch training for deep neural networks. arXiv:180407612
Mutschler M, Zell A (2020) Parabolic approximation line search for dnns. Adv Neural Inf Process Syst 33:5405–5416
Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: an imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, dAlché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems 32, Curran associates, Inc., pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Accessed 06 Aug 2021
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat, pp 400–407
Snyman J, Wilke D (2018) Practical mathematical optimization: basic optimization theory and gradient-based algorithms. Springer optimization and its applications, Springer international publishing. https://books.google.co.kr/books?id=n1dLswEACAAJ. Accessed 27 July 2021
Strubell E, Ganesh A, McCallum A (2020) Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI conference on artificial intelligence vol 34, pp 13693–13696
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, PMLR, pp 6105–6114
Yedida R, Saha S, Prashanth T (2021) LipschitzLR: using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51(3):1460–1478
Article Google Scholar

Download references

Acknowledgements

This research was supported by the National Research Foundation (NRF), South Africa, and the Center for Asset Integrity Management (C-AIM), Department of Mechanical and Aeronautical Engineering, University of Pretoria, Pretoria, South Africa. We would like to express our special thanks to Nvidia Corporation for supplying the GPUs on which this research was conducted.

Author information

Authors and Affiliations

University of Pretoria, Lynnwood Rd, Hatfield, Pretoria, 0002, South Africa
Younghwan Chae, Daniel N. Wilke & Dominic Kafka

Authors

Younghwan Chae
View author publications
You can also search for this author in PubMed Google Scholar
Daniel N. Wilke
View author publications
You can also search for this author in PubMed Google Scholar
Dominic Kafka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Younghwan Chae.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chae, Y., Wilke, D.N. & Kafka, D. Gradient-only surrogate to resolve learning rates for robust and consistent training of deep neural networks. Appl Intell 53, 13741–13762 (2023). https://doi.org/10.1007/s10489-022-04206-8

Download citation

Accepted: 25 September 2022
Published: 15 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04206-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gradient-only surrogate to resolve learning rates for robust and consistent training of deep neural networks

Abstract

Access this article

Similar content being viewed by others

Empirically Explaining SGD from a Line Search Perspective

Resolving learning rates adaptively by locating stochastic non-negative associated gradient projection points using line searches

Steplength and Mini-batch Size Selection in Stochastic Gradient Methods

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Gradient-only surrogate to resolve learning rates for robust and consistent training of deep neural networks

Abstract

Access this article

Similar content being viewed by others

Empirically Explaining SGD from a Line Search Perspective

Resolving learning rates adaptively by locating stochastic non-negative associated gradient projection points using line searches

Steplength and Mini-batch Size Selection in Stochastic Gradient Methods

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation