Abstract
We present a novel theoretical framework for computing large, adaptive learning rates. Our framework makes minimal assumptions on the activations used and exploits the functional properties of the loss function. Specifically, we show that the inverse of the Lipschitz constant of the loss function is an ideal learning rate. We analytically compute formulas for the Lipschitz constant of several loss functions, and through extensive experimentation, demonstrate the strength of our approach using several architectures and datasets. In addition, we detail the computation of learning rates when other optimizers, namely, SGD with momentum, RMSprop, and Adam, are used. Compared to standard choices of learning rates, our approach converges faster, and yields better results.
Similar content being viewed by others
Notes
Note this is a weaker condition than assuming the gradient of the function being Lipschitz continuous. We exploit merely the boundedness of the gradient.
Since f ∈ C2 is a condition for obtaining lower bound on the number of iterations to converge for the choice of Lipschitz learning rate, Mean Absolute Error can’t be used as loss function.
References
Bahar P, Alkhouli T, Peter JT, Brix CJS, Ney H (2017) Empirical investigation of optimization algorithms in neural machine translation. The Prague Bulletin of Mathematical Linguistics 108(1):13–25
Banach S (1922) Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. Math 3(1):133–181
Bengio Y (2012) Neural networks: Tricks of the trade. Practical Recommendations for Gradient-Based Training of Deep Architectures, 2nd edn. Springer, Berlin, pp 437–478
Bengio Y, Simard P, Frasconi P, et al. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Broyden CG (1970) The convergence of a class of double-rank minimization algorithms: 2. the new algorithm. IMA J Appl Math 6(3):222–231
Cauchy A (1847) Méthode générale pour la résolution des systemes d’équations simultanées. Comp Rend Sci Paris 25(1847):536–538
Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054
Fernandes K, Vinagre P, Cortez P (2015) A proactive intelligent decision support system for predicting the popularity of online news. In: Portuguese conference on artificial intelligence, Springer, pp 535–546
Fletcher R (1970) A new approach to variable metric algorithms. The Computer Journal 13 (3):317–322
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Goldfarb D (1970) A family of variable-metric methods derived by variational means. Mathematics of Computation 24(109):23–26
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press
Hardt M, Recht B, Singer Y (2015) Train faster, generalize better: Stability of stochastic gradient descent. arXiv:1509.01240
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision, Springer, pp 630–645
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Kingma DP, Ba J (2014) Adam: A method for stochastic optimizatio. arXiv:1412.6980
Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-normalizing neural networks. In: Advances in neural information processing systems, pp 971–980
Kuzborskij I, Lampert CH (2017) Data-dependent stability of stochastic gradient descent. arXiv:1703.01678
Liu DC, Nocedal J (1989) On the limited memory bfgs method for large scale optimization. Mathematical Programming 45(1-3):503–528
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. arXiv:1902.09843
Martens J, Sutskever I, Swersky K (2012) Estimating the hessian by back-propagating curvature. arXiv:1206.6464
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434
Reddi SJ, Kale S, Kumar S (2019) On the convergence of adam and beyond. arXiv:1904.09237
Saha S (2011) Differential Equations: A structured Approach. Cognella
Seong S, Lee Y, Kee Y, Han D, Kim J (2018) Towards flatter loss surface via nonmonotonic learning rate scheduling. In: UAI2018 Conference on uncertainty in artificial intelligence. Association for uncertainty in artificial intelligence (AUAI)
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229
Shanno DF (1970) Conditioning of quasi-newton methods for function minimization. Mathematics of Computation 24(111):647–656
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Smith LN (2017) Cyclical learning rates for training neural networks. In: 2017 IEEE Winter conference on applications of computer vision (WACV), IEEE, pp 464–472
Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv:1803.09820
Smith LN, Topin N (2017) Super-convergence: Very fast training of neural networks using large learning rates. arXiv:1708.07120
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting, vol 15
Srivastava RK, Greff K, Schmidhuber J (2015)
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp 1139–1147
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1701–1708
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4(2):26–31
Tsanas A, Xifara A (2012) Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings 49:560–567
Wu X, Ward R, Bottou L (2018) Wngrad: Learn the learning rate in gradient descent. arXiv:1803.02865
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, Springer, pp 818–833
Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y (2018)
Zou D, Cao Y, Zhou D, Gu Q (2018) Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv:1811.08888
Funding
This work was supported by the Science and Engineering Research Board (SERB)-Department of Science and Technology (DST), Government of India (project reference number SERB-EMR/ 2016/005687). The funding source was not involved in the study design, writing of the report, or in the decision to submit this article for publication.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no other conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Implementation Details
Appendix: Implementation Details
All our code was written using the Keras deep learning library. The architecture we used for MNIST was taken from a Kaggle Python notebook by Aditya Soni.Footnote 3 For ResNets, we used the code from the Examples section of the Keras documentation.Footnote 4 The DenseNet implementation we used was from a GitHub repository by Somshubra Majumdar.Footnote 5 Finally, our implementation of SGD with momentum is a modified version of the Adam implementation in Keras.Footnote 6 All our code, saved models, training logs, and images are available on GitHub at https://github.com/yrahul3910/adaptive-lr-dnn.
Rights and permissions
About this article
Cite this article
Yedida, R., Saha, S. & Prashanth, T. LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51, 1460–1478 (2021). https://doi.org/10.1007/s10489-020-01892-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01892-0