LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

Yedida, Rahul; Saha, Snehanshu; Prashanth, Tejas

doi:10.1007/s10489-020-01892-0

LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

Published: 28 September 2020

Volume 51, pages 1460–1478, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

789 Accesses
24 Citations
Explore all metrics

Abstract

We present a novel theoretical framework for computing large, adaptive learning rates. Our framework makes minimal assumptions on the activations used and exploits the functional properties of the loss function. Specifically, we show that the inverse of the Lipschitz constant of the loss function is an ideal learning rate. We analytically compute formulas for the Lipschitz constant of several loss functions, and through extensive experimentation, demonstrate the strength of our approach using several architectures and datasets. In addition, we detail the computation of learning rates when other optimizers, namely, SGD with momentum, RMSprop, and Adam, are used. Compared to standard choices of learning rates, our approach converges faster, and yields better results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Development and Application of Artificial Neural Network

Article 30 December 2017

A review on the long short-term memory model

Article 13 May 2020

Notes

Note this is a weaker condition than assuming the gradient of the function being Lipschitz continuous. We exploit merely the boundedness of the gradient.
Since f ∈ C² is a condition for obtaining lower bound on the number of iterations to converge for the choice of Lipschitz learning rate, Mean Absolute Error can’t be used as loss function.
https://www.kaggle.com/adityaecdrid/mnist-with-keras-for-beginners-99457
https://keras.io/examples/cifar10_resnet/
https://github.com/titu1994/DenseNet
https://github.com/keras-team/keras/blob/master/keras/optimizers.py#L436

References

Bahar P, Alkhouli T, Peter JT, Brix CJS, Ney H (2017) Empirical investigation of optimization algorithms in neural machine translation. The Prague Bulletin of Mathematical Linguistics 108(1):13–25
Article Google Scholar
Banach S (1922) Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. Math 3(1):133–181
Article MathSciNet Google Scholar
Bengio Y (2012) Neural networks: Tricks of the trade. Practical Recommendations for Gradient-Based Training of Deep Architectures, 2nd edn. Springer, Berlin, pp 437–478
Book Google Scholar
Bengio Y, Simard P, Frasconi P, et al. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Broyden CG (1970) The convergence of a class of double-rank minimization algorithms: 2. the new algorithm. IMA J Appl Math 6(3):222–231
Article MathSciNet Google Scholar
Cauchy A (1847) Méthode générale pour la résolution des systemes d’équations simultanées. Comp Rend Sci Paris 25(1847):536–538
Google Scholar
Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054
Fernandes K, Vinagre P, Cortez P (2015) A proactive intelligent decision support system for predicting the popularity of online news. In: Portuguese conference on artificial intelligence, Springer, pp 535–546
Fletcher R (1970) A new approach to variable metric algorithms. The Computer Journal 13 (3):317–322
Article Google Scholar
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Goldfarb D (1970) A family of variable-metric methods derived by variational means. Mathematics of Computation 24(109):23–26
Article MathSciNet Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press
Hardt M, Recht B, Singer Y (2015) Train faster, generalize better: Stability of stochastic gradient descent. arXiv:1509.01240
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision, Springer, pp 630–645
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Kingma DP, Ba J (2014) Adam: A method for stochastic optimizatio. arXiv:1412.6980
Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-normalizing neural networks. In: Advances in neural information processing systems, pp 971–980
Kuzborskij I, Lampert CH (2017) Data-dependent stability of stochastic gradient descent. arXiv:1703.01678
Liu DC, Nocedal J (1989) On the limited memory bfgs method for large scale optimization. Mathematical Programming 45(1-3):503–528
Article MathSciNet Google Scholar
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. arXiv:1902.09843
Martens J, Sutskever I, Swersky K (2012) Estimating the hessian by back-propagating curvature. arXiv:1206.6464
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434
Reddi SJ, Kale S, Kumar S (2019) On the convergence of adam and beyond. arXiv:1904.09237
Saha S (2011) Differential Equations: A structured Approach. Cognella
Seong S, Lee Y, Kee Y, Han D, Kim J (2018) Towards flatter loss surface via nonmonotonic learning rate scheduling. In: UAI2018 Conference on uncertainty in artificial intelligence. Association for uncertainty in artificial intelligence (AUAI)
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229
Shanno DF (1970) Conditioning of quasi-newton methods for function minimization. Mathematics of Computation 24(111):647–656
Article MathSciNet Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Smith LN (2017) Cyclical learning rates for training neural networks. In: 2017 IEEE Winter conference on applications of computer vision (WACV), IEEE, pp 464–472
Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv:1803.09820
Smith LN, Topin N (2017) Super-convergence: Very fast training of neural networks using large learning rates. arXiv:1708.07120
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting, vol 15
Srivastava RK, Greff K, Schmidhuber J (2015)
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp 1139–1147
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1701–1708
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4(2):26–31
Google Scholar
Tsanas A, Xifara A (2012) Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings 49:560–567
Article Google Scholar
Wu X, Ward R, Bottou L (2018) Wngrad: Learn the learning rate in gradient descent. arXiv:1803.02865
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, Springer, pp 818–833
Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y (2018)
Zou D, Cao Y, Zhou D, Gu Q (2018) Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv:1811.08888

Download references

Funding

This work was supported by the Science and Engineering Research Board (SERB)-Department of Science and Technology (DST), Government of India (project reference number SERB-EMR/ 2016/005687). The funding source was not involved in the study design, writing of the report, or in the decision to submit this article for publication.

Author information

Authors and Affiliations

Department of Computer Science, North Carolina State University, Raleigh, NC, USA
Rahul Yedida
Department of CS, IS and APPCAIR, BITS Pilani K K Birla Goa Campus, Bengaluru, India
Snehanshu Saha
Department of Computer Science, PES University, Bengaluru, India
Tejas Prashanth

Authors

Rahul Yedida
View author publications
You can also search for this author in PubMed Google Scholar
Snehanshu Saha
View author publications
You can also search for this author in PubMed Google Scholar
Tejas Prashanth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rahul Yedida.

Ethics declarations

Conflict of interests

The authors declare that they have no other conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Implementation Details

All our code was written using the Keras deep learning library. The architecture we used for MNIST was taken from a Kaggle Python notebook by Aditya Soni.^{Footnote 3} For ResNets, we used the code from the Examples section of the Keras documentation.^{Footnote 4} The DenseNet implementation we used was from a GitHub repository by Somshubra Majumdar.^{Footnote 5} Finally, our implementation of SGD with momentum is a modified version of the Adam implementation in Keras.^{Footnote 6} All our code, saved models, training logs, and images are available on GitHub at https://github.com/yrahul3910/adaptive-lr-dnn.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yedida, R., Saha, S. & Prashanth, T. LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51, 1460–1478 (2021). https://doi.org/10.1007/s10489-020-01892-0

Download citation

Published: 28 September 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s10489-020-01892-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Development and Application of Artificial Neural Network

A review on the long short-term memory model

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Appendix: Implementation Details

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Development and Application of Artificial Neural Network

A review on the long short-term memory model

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Appendix: Implementation Details

Appendix: Implementation Details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation