Skip to main content
Log in

Gradient-only surrogate to resolve learning rates for robust and consistent training of deep neural networks

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Mini-batch sub-sampling (MBSS) is favored in deep neural network training to reduce the computational cost. Still, it introduces an inherent sampling error, making the selection of appropriate learning rates challenging. The sampling errors can manifest either as a bias or variances in a line search. Dynamic MBSS re-samples a mini-batch at every function evaluation. Hence, dynamic MBSS results in point-wise discontinuous loss functions with smaller bias but larger variance than static sampled loss functions. However, dynamic MBSS has the advantage of having larger data throughput during training but requires resolving the complexity regarding discontinuities. This study extends the vanilla gradient-only surrogate line search (GOS-LS), a line search method using quadratic approximation models built with only directional derivative information for dynamic MBSS loss functions. We propose a conservative gradient-only surrogate line search (GOS-LSC) with strong convergence characteristics with a defined optimality criterion. For the first time, we investigate both GOS-LS’s and GOS-LSC’s performance on various optimizers, including SGD, RMSProp, and Adam on ResNet-18 and EfficientNet-B0. We also compare GOS-LS and GOS-LSC against the other existing learning rate methods. We quantify both the best-performing and most robust algorithms. For the latter, we introduce a relative robust criterion that allows us to quantify the difference between an algorithm and the best performing algorithm for a given problem. The results show that training a model with the recommended learning rate for a class of search directions helps to reduce the model errors in multimodal cases. The results also show that GOS-LS ranked first in training and test results, while GOS-LSC ranked third and second in training and test results among nine other learning rate strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Algorithm 2
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. software available from tensorflow.org. https://www.tensorflow.org/. Accessed 10 Aug 2021

  2. Agarwal N, Bullins B, Hazan E (2017) Second-order stochastic optimization for machine learning in linear time. J Mach Learn Res 18(1):4148–4187

    MathSciNet  MATH  Google Scholar 

  3. Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Neural networks tricks of the trade. Springer, pp 437–478

  4. Bergou Eh, Diouane Y, Kunc V, Kungurtsev V, Royer CW (2018) A subsampling line search method with second-order results. arXiv:181007211

  5. Bollapragada R, Byrd R, Nocedal J (2018) Adaptive sampling strategies for stochastic optimization. SIAM J Optim 28(4):3312–3343

    Article  MathSciNet  MATH  Google Scholar 

  6. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, Curran associates, Inc., vol 33, pp 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Accessed 04 Oct 2021

  7. Byrd RH, Chin GM, Nocedal J, Wu Y (2012) Sample size selection in optimization methods for machine learning. Math Program 134(1):127–155. https://doi.org/10.1007/s10107-012-0572-5

    Article  MathSciNet  MATH  Google Scholar 

  8. Chae Y, Wilke DN (2019) Empirical study towards understanding line search approximations for training neural networks. arXiv:190906893

  9. Friedlander MP, Schmidt M (2012) Hybrid deterministic-stochastic methods for data fitting. SIAM J Sci Comput 34(3):A1380–A1405

    Article  MathSciNet  MATH  Google Scholar 

  10. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR workshop and conference proceedings, pp 249–256

  11. Goodfellow I, Bengio Y, Courville A (2016) Deep learning (Adaptive computation and machine learning series). The MIT Press

  12. Gupta RK (2019) Numerical methods: fundamentals and applications. Cambridge University Press

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  14. Kafka D, Wilke DN (2019) Gradient-only line searches: an alternative to probabilistic line searches. arXiv:190309383

  15. Kafka D, Wilke DN (2021) Resolving learning rates adaptively by locating stochastic non-negative associated gradient projection points using line searches. J Glob Optim 79(1):111–152

    Article  MathSciNet  MATH  Google Scholar 

  16. Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech rep, Department of Computer Science, University of Toronto

  17. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  18. Liu K (2020) 95.16% on CIFAR10 with PyTorch. https://github.com/kuangliu/pytorch-cifar. Accessed 12 June 2021

  19. Loshchilov I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference track proceedings, OpenReview.net. https://openreview.net/forum?id=Skq89Scxx

  20. Lyapunov AM (1992) The general problem of the stability of motion. Int J Control 55(3):531–534

    Article  MathSciNet  Google Scholar 

  21. Mahsereci M, Hennig P (2017) Probabilistic line searches for stochastic optimization. J Mach Learn Res 18(1):4262–4320

    MathSciNet  MATH  Google Scholar 

  22. Masters D, Luschi C (2018) Revisiting small batch training for deep neural networks. arXiv:180407612

  23. Mutschler M, Zell A (2020) Parabolic approximation line search for dnns. Adv Neural Inf Process Syst 33:5405–5416

    Google Scholar 

  24. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: an imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, dAlché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems 32, Curran associates, Inc., pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdfhttp://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Accessed 06 Aug 2021

  25. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat, pp 400–407

  26. Snyman J, Wilke D (2018) Practical mathematical optimization: basic optimization theory and gradient-based algorithms. Springer optimization and its applications, Springer international publishing. https://books.google.co.kr/books?id=n1dLswEACAAJ. Accessed 27 July 2021

  27. Strubell E, Ganesh A, McCallum A (2020) Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI conference on artificial intelligence vol 34, pp 13693–13696

  28. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, PMLR, pp 6105–6114

  29. Yedida R, Saha S, Prashanth T (2021) LipschitzLR: using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51(3):1460–1478

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the National Research Foundation (NRF), South Africa, and the Center for Asset Integrity Management (C-AIM), Department of Mechanical and Aeronautical Engineering, University of Pretoria, Pretoria, South Africa. We would like to express our special thanks to Nvidia Corporation for supplying the GPUs on which this research was conducted.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Younghwan Chae.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chae, Y., Wilke, D.N. & Kafka, D. Gradient-only surrogate to resolve learning rates for robust and consistent training of deep neural networks. Appl Intell 53, 13741–13762 (2023). https://doi.org/10.1007/s10489-022-04206-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04206-8

Keywords

Navigation