Skip to main content
Log in

LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

We present a novel theoretical framework for computing large, adaptive learning rates. Our framework makes minimal assumptions on the activations used and exploits the functional properties of the loss function. Specifically, we show that the inverse of the Lipschitz constant of the loss function is an ideal learning rate. We analytically compute formulas for the Lipschitz constant of several loss functions, and through extensive experimentation, demonstrate the strength of our approach using several architectures and datasets. In addition, we detail the computation of learning rates when other optimizers, namely, SGD with momentum, RMSprop, and Adam, are used. Compared to standard choices of learning rates, our approach converges faster, and yields better results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Note this is a weaker condition than assuming the gradient of the function being Lipschitz continuous. We exploit merely the boundedness of the gradient.

  2. Since fC2 is a condition for obtaining lower bound on the number of iterations to converge for the choice of Lipschitz learning rate, Mean Absolute Error can’t be used as loss function.

  3. https://www.kaggle.com/adityaecdrid/mnist-with-keras-for-beginners-99457

  4. https://keras.io/examples/cifar10_resnet/

  5. https://github.com/titu1994/DenseNet

  6. https://github.com/keras-team/keras/blob/master/keras/optimizers.py#L436

References

  1. Bahar P, Alkhouli T, Peter JT, Brix CJS, Ney H (2017) Empirical investigation of optimization algorithms in neural machine translation. The Prague Bulletin of Mathematical Linguistics 108(1):13–25

    Article  Google Scholar 

  2. Banach S (1922) Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. Math 3(1):133–181

    Article  MathSciNet  Google Scholar 

  3. Bengio Y (2012) Neural networks: Tricks of the trade. Practical Recommendations for Gradient-Based Training of Deep Architectures, 2nd edn. Springer, Berlin, pp 437–478

    Book  Google Scholar 

  4. Bengio Y, Simard P, Frasconi P, et al. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

    Article  Google Scholar 

  5. Broyden CG (1970) The convergence of a class of double-rank minimization algorithms: 2. the new algorithm. IMA J Appl Math 6(3):222–231

    Article  MathSciNet  Google Scholar 

  6. Cauchy A (1847) Méthode générale pour la résolution des systemes d’équations simultanées. Comp Rend Sci Paris 25(1847):536–538

    Google Scholar 

  7. Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054

  8. Fernandes K, Vinagre P, Cortez P (2015) A proactive intelligent decision support system for predicting the popularity of online news. In: Portuguese conference on artificial intelligence, Springer, pp 535–546

  9. Fletcher R (1970) A new approach to variable metric algorithms. The Computer Journal 13 (3):317–322

    Article  Google Scholar 

  10. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  11. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256

  12. Goldfarb D (1970) A family of variable-metric methods derived by variational means. Mathematics of Computation 24(109):23–26

    Article  MathSciNet  Google Scholar 

  13. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press

  14. Hardt M, Recht B, Singer Y (2015) Train faster, generalize better: Stability of stochastic gradient descent. arXiv:1509.01240

  15. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034

  16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  17. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision, Springer, pp 630–645

  18. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708

  19. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167

  20. Kingma DP, Ba J (2014) Adam: A method for stochastic optimizatio. arXiv:1412.6980

  21. Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-normalizing neural networks. In: Advances in neural information processing systems, pp 971–980

  22. Kuzborskij I, Lampert CH (2017) Data-dependent stability of stochastic gradient descent. arXiv:1703.01678

  23. Liu DC, Nocedal J (1989) On the limited memory bfgs method for large scale optimization. Mathematical Programming 45(1-3):503–528

    Article  MathSciNet  Google Scholar 

  24. Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. arXiv:1902.09843

  25. Martens J, Sutskever I, Swersky K (2012) Estimating the hessian by back-propagating curvature. arXiv:1206.6464

  26. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814

  27. Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434

  28. Reddi SJ, Kale S, Kumar S (2019) On the convergence of adam and beyond. arXiv:1904.09237

  29. Saha S (2011) Differential Equations: A structured Approach. Cognella

  30. Seong S, Lee Y, Kee Y, Han D, Kim J (2018) Towards flatter loss surface via nonmonotonic learning rate scheduling. In: UAI2018 Conference on uncertainty in artificial intelligence. Association for uncertainty in artificial intelligence (AUAI)

  31. Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229

  32. Shanno DF (1970) Conditioning of quasi-newton methods for function minimization. Mathematics of Computation 24(111):647–656

    Article  MathSciNet  Google Scholar 

  33. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  34. Smith LN (2017) Cyclical learning rates for training neural networks. In: 2017 IEEE Winter conference on applications of computer vision (WACV), IEEE, pp 464–472

  35. Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv:1803.09820

  36. Smith LN, Topin N (2017) Super-convergence: Very fast training of neural networks using large learning rates. arXiv:1708.07120

  37. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting, vol 15

  38. Srivastava RK, Greff K, Schmidhuber J (2015)

  39. Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp 1139–1147

  40. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  41. Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1701–1708

  42. Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4(2):26–31

    Google Scholar 

  43. Tsanas A, Xifara A (2012) Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings 49:560–567

    Article  Google Scholar 

  44. Wu X, Ward R, Bottou L (2018) Wngrad: Learn the learning rate in gradient descent. arXiv:1803.02865

  45. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  46. Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701

  47. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, Springer, pp 818–833

  48. Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y (2018)

  49. Zou D, Cao Y, Zhou D, Gu Q (2018) Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv:1811.08888

Download references

Funding

This work was supported by the Science and Engineering Research Board (SERB)-Department of Science and Technology (DST), Government of India (project reference number SERB-EMR/ 2016/005687). The funding source was not involved in the study design, writing of the report, or in the decision to submit this article for publication.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rahul Yedida.

Ethics declarations

Conflict of interests

The authors declare that they have no other conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Implementation Details

Appendix: Implementation Details

All our code was written using the Keras deep learning library. The architecture we used for MNIST was taken from a Kaggle Python notebook by Aditya Soni.Footnote 3 For ResNets, we used the code from the Examples section of the Keras documentation.Footnote 4 The DenseNet implementation we used was from a GitHub repository by Somshubra Majumdar.Footnote 5 Finally, our implementation of SGD with momentum is a modified version of the Adam implementation in Keras.Footnote 6 All our code, saved models, training logs, and images are available on GitHub at https://github.com/yrahul3910/adaptive-lr-dnn.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yedida, R., Saha, S. & Prashanth, T. LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51, 1460–1478 (2021). https://doi.org/10.1007/s10489-020-01892-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01892-0

Keywords

Navigation