Abstract
RMSProp is one of the most popular stochastic optimization algorithms in deep learning applications. However, recent work has pointed out that this method may not converge to the optimal solution even in simple convex settings. To this end, we propose a time-varying version of RMSProp to fix the non-convergence issues. Specifically, the hyperparameter, \(\beta _t\), is considered as a time-varying sequence rather than a fine-tuned constant. We also provide a rigorous proof that the RMSProp can converge to critical points even for smooth and non-convex objectives, with a convergence rate of order \(\mathcal {O}(\log T/\sqrt{T})\). This provides a new understanding of RMSProp divergence, a common issue in practical applications. Finally, numerical experiments show that time-varying RMSProp exhibits advantages over standard RMSProp on benchmark datasets and support the theoretical results.
Similar content being viewed by others
Data availability
The datasets analysed during the current study are available in the following public domain resources: http://yann.lecun.com/exdb/mnist/; http://www.cs.toronto.edu/~kriz/cifar.html.
References
Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Belmont
Bottou L, Bousquet O (2007) The tradeoffs of large scale learning. In: Proc Adv Neural Inf Process Syst, Vancouver, CA, pp 161–168
Chen X, Liu S, Sun R, et al. (2019) On the convergence of a class of Adam-type algorithms for non-convex optimization. In: Proc Int Conf Learn Repres, New Orleans, USA
De S, Mukherjee A, Ullah E (2018) Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv preprint arXiv:1807.06766
Défossez A, Bottou L, Bach F, et al. (2020) A simple convergence proof of Adam and Adagrad. arXiv preprint arXiv:2003.02395
Dozat T (2016) Incorporating Nesterov momentum into Adam. In: Proc Int Conf Learn Repres, San Juan, Puerto Rico
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):2121–2159
Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program 156(1–2):59–99. https://doi.org/10.1007/s10107-015-0871-8
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, London
He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. In: Proc IEEE Conf Comp Vis Patt Recogn, pp 770–778, https://doi.org/10.1109/CVPR.2016.90
Huang H, Wang C, Dong B (2019) Nostalgic Adam: weighting more of the past gradients when designing the adaptive learning rate. In: Proc Int Joint Conf Artif Intell, Macao, China, pp 2556–2562, https://doi.org/10.24963/ijcai.2019/355
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proc Int Conf Learn Repres, San Diego, USA
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Dissertation, University of Toronto, Toronto
LeCun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Lin Z, Bai Z (2011) Probability inequalities. Springer, Beijing. https://doi.org/10.1007/978-3-642-05261-3
Liu J, Kong J, Xu D et al (2022) Convergence analysis of Adabound with relaxed bound functions for non-convex optimization. Neural Netw 145:300–307. https://doi.org/10.1016/j.neunet.2021.10.026
Loshchilov I, Hutter F (2018) Fixing weight decay regularization in adam. In: Proc Int Conf Learn Repres, Vancouver, CA
Luo J, Liu J, Xu D et al (2022) SGD-r\(\alpha \): areal-time \(\alpha \)-suffix averaging method for SGD with biased gradient estimates. Neurocomputing 487:1–8. https://doi.org/10.1016/j.neucom.2022.02.063
Luo L, Xiong Y, Liu Y, et al. (2019) Adaptive gradient methods with dynamic bound of learning rate. In: Proc Int Conf Learn Repres, New Orleans, USA
Mandic D, Chambers J (2001) Recurrent neural networks for prediction: learning algorithms, architectures and stability. Wiley, Chichester. https://doi.org/10.1002/047084535X
Nesterov Y (2003) Introductory lectures on convex optimization: a basic course,. Springer, New York. https://doi.org/10.1007/978-1-4419-8853-9
Reddi S, Kale S, Kumar S (2018a) On the convergence of Adam and beyond. In: Proc Int Conf Learn Repres, Vancouver, CA
Reddi S, Zaheer M, Sachan D, et al. (2018b) Adaptive methods for nonconvex optimization. In: Proc Adv Neural Inf Process Syst, Montréal, CA
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Statistics pp 400–407. https://doi.org/10.1214/aoms/1177729586
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Shi N, Li D, Hong M, et al. (2020) RMSProp converges with proper hyper-parameter. In: Proc Int Conf Learn Repres, Addis Ababa, Ethiopia
Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Proc Adv Neural Inf Process Syst, Nevada, USA, pp 2951–2959, https://doi.org/10.5555/2999325.2999464
Tieleman T, Hinton G et al (2012) Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31
Xu D, Zhang S, Zhang H et al (2021) Convergence of the RMSProp deep learning method with penalty for nonconvex optimization. Neural Netw 139:17–23. https://doi.org/10.1016/j.neunet.2021.02.011
Yan Y, Yang T, Li Z, et al. (2018) A unified analysis of stochastic momentum methods for deep learning. In: Proc Int Joint Conf Artif Intell, Stockholm, Sweden, pp 2955–2961, https://doi.org/10.24963/ijcai.2018/410
Zhou D, Chen J, Cao Y, et al. (2018) On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671
Zhou Z, Zhang Q, Lu G, et al. (2019) Adashift: Decorrelation and convergence of adaptive learning rate methods. In: Proc Int Conf Learn Repres, New Orleans, USA
Zou F, Shen L, Jie Z, et al. (2018) Weighted adagrad with unified momentum. arXiv preprint arXiv:1808.03408
Zou F, Shen L, Jie Z, et al. (2019) A sufficient condition for convergences of Adam and RMSProp. In: Proc IEEE Conf Comp Vis Patt Recogn, Long Beach, USA, pp 11,127–11,135, https://doi.org/10.1109/CVPR.2019.01138
Funding
This work was funded in part by the National Natural Science Foundation of China (Nos. 62176051, 61671099), in part by National Key R &D Program of China (No. 2020YFA0714102), and in part by the Fundamental Research Funds for the Central Universities of China (No. 2412020FZ024).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
Ethics approval is not applicable for this research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, J., Xu, D., Zhang, H. et al. On hyper-parameter selection for guaranteed convergence of RMSProp. Cogn Neurodyn (2022). https://doi.org/10.1007/s11571-022-09845-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11571-022-09845-8