Skip to main content
Log in

On hyper-parameter selection for guaranteed convergence of RMSProp

  • Research Article
  • Published:
Cognitive Neurodynamics Aims and scope Submit manuscript

Abstract

RMSProp is one of the most popular stochastic optimization algorithms in deep learning applications. However, recent work has pointed out that this method may not converge to the optimal solution even in simple convex settings. To this end, we propose a time-varying version of RMSProp to fix the non-convergence issues. Specifically, the hyperparameter, \(\beta _t\), is considered as a time-varying sequence rather than a fine-tuned constant. We also provide a rigorous proof that the RMSProp can converge to critical points even for smooth and non-convex objectives, with a convergence rate of order \(\mathcal {O}(\log T/\sqrt{T})\). This provides a new understanding of RMSProp divergence, a common issue in practical applications. Finally, numerical experiments show that time-varying RMSProp exhibits advantages over standard RMSProp on benchmark datasets and support the theoretical results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The datasets analysed during the current study are available in the following public domain resources: http://yann.lecun.com/exdb/mnist/; http://www.cs.toronto.edu/~kriz/cifar.html.

Notes

  1. https://pytorch.org

  2. https://www.tensorflow.org

  3. https://mxnet.apache.org

  4. https://github.com/kuangliu/pytorch-cifar

  5. https://github.com/soundsinteresting/RMSprop

  6. https://ax.dev/versions/0.1.1/tutorials

References

  • Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Belmont

    Google Scholar 

  • Bottou L, Bousquet O (2007) The tradeoffs of large scale learning. In: Proc Adv Neural Inf Process Syst, Vancouver, CA, pp 161–168

  • Chen X, Liu S, Sun R, et al. (2019) On the convergence of a class of Adam-type algorithms for non-convex optimization. In: Proc Int Conf Learn Repres, New Orleans, USA

  • De S, Mukherjee A, Ullah E (2018) Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv preprint arXiv:1807.06766

  • Défossez A, Bottou L, Bach F, et al. (2020) A simple convergence proof of Adam and Adagrad. arXiv preprint arXiv:2003.02395

  • Dozat T (2016) Incorporating Nesterov momentum into Adam. In: Proc Int Conf Learn Repres, San Juan, Puerto Rico

  • Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):2121–2159

    Google Scholar 

  • Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program 156(1–2):59–99. https://doi.org/10.1007/s10107-015-0871-8

    Article  Google Scholar 

  • Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, London

    Google Scholar 

  • He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. In: Proc IEEE Conf Comp Vis Patt Recogn, pp 770–778, https://doi.org/10.1109/CVPR.2016.90

  • Huang H, Wang C, Dong B (2019) Nostalgic Adam: weighting more of the past gradients when designing the adaptive learning rate. In: Proc Int Joint Conf Artif Intell, Macao, China, pp 2556–2562, https://doi.org/10.24963/ijcai.2019/355

  • Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proc Int Conf Learn Repres, San Diego, USA

  • Krizhevsky A (2009) Learning multiple layers of features from tiny images. Dissertation, University of Toronto, Toronto

  • LeCun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791

    Article  Google Scholar 

  • Lin Z, Bai Z (2011) Probability inequalities. Springer, Beijing. https://doi.org/10.1007/978-3-642-05261-3

    Book  Google Scholar 

  • Liu J, Kong J, Xu D et al (2022) Convergence analysis of Adabound with relaxed bound functions for non-convex optimization. Neural Netw 145:300–307. https://doi.org/10.1016/j.neunet.2021.10.026

    Article  PubMed  Google Scholar 

  • Loshchilov I, Hutter F (2018) Fixing weight decay regularization in adam. In: Proc Int Conf Learn Repres, Vancouver, CA

  • Luo J, Liu J, Xu D et al (2022) SGD-r\(\alpha \): areal-time \(\alpha \)-suffix averaging method for SGD with biased gradient estimates. Neurocomputing 487:1–8. https://doi.org/10.1016/j.neucom.2022.02.063

    Article  Google Scholar 

  • Luo L, Xiong Y, Liu Y, et al. (2019) Adaptive gradient methods with dynamic bound of learning rate. In: Proc Int Conf Learn Repres, New Orleans, USA

  • Mandic D, Chambers J (2001) Recurrent neural networks for prediction: learning algorithms, architectures and stability. Wiley, Chichester. https://doi.org/10.1002/047084535X

    Book  Google Scholar 

  • Nesterov Y (2003) Introductory lectures on convex optimization: a basic course,. Springer, New York. https://doi.org/10.1007/978-1-4419-8853-9

    Book  Google Scholar 

  • Reddi S, Kale S, Kumar S (2018a) On the convergence of Adam and beyond. In: Proc Int Conf Learn Repres, Vancouver, CA

  • Reddi S, Zaheer M, Sachan D, et al. (2018b) Adaptive methods for nonconvex optimization. In: Proc Adv Neural Inf Process Syst, Montréal, CA

  • Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Statistics pp 400–407. https://doi.org/10.1214/aoms/1177729586

  • Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003

    Article  PubMed  Google Scholar 

  • Shi N, Li D, Hong M, et al. (2020) RMSProp converges with proper hyper-parameter. In: Proc Int Conf Learn Repres, Addis Ababa, Ethiopia

  • Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Proc Adv Neural Inf Process Syst, Nevada, USA, pp 2951–2959, https://doi.org/10.5555/2999325.2999464

  • Tieleman T, Hinton G et al (2012) Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31

    Google Scholar 

  • Xu D, Zhang S, Zhang H et al (2021) Convergence of the RMSProp deep learning method with penalty for nonconvex optimization. Neural Netw 139:17–23. https://doi.org/10.1016/j.neunet.2021.02.011

    Article  PubMed  Google Scholar 

  • Yan Y, Yang T, Li Z, et al. (2018) A unified analysis of stochastic momentum methods for deep learning. In: Proc Int Joint Conf Artif Intell, Stockholm, Sweden, pp 2955–2961, https://doi.org/10.24963/ijcai.2018/410

  • Zhou D, Chen J, Cao Y, et al. (2018) On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671

  • Zhou Z, Zhang Q, Lu G, et al. (2019) Adashift: Decorrelation and convergence of adaptive learning rate methods. In: Proc Int Conf Learn Repres, New Orleans, USA

  • Zou F, Shen L, Jie Z, et al. (2018) Weighted adagrad with unified momentum. arXiv preprint arXiv:1808.03408

  • Zou F, Shen L, Jie Z, et al. (2019) A sufficient condition for convergences of Adam and RMSProp. In: Proc IEEE Conf Comp Vis Patt Recogn, Long Beach, USA, pp 11,127–11,135, https://doi.org/10.1109/CVPR.2019.01138

Download references

Funding

This work was funded in part by the National Natural Science Foundation of China (Nos. 62176051, 61671099), in part by National Key R &D Program of China (No. 2020YFA0714102), and in part by the Fundamental Research Funds for the Central Universities of China (No. 2412020FZ024).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongpo Xu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

Ethics approval is not applicable for this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Xu, D., Zhang, H. et al. On hyper-parameter selection for guaranteed convergence of RMSProp. Cogn Neurodyn (2022). https://doi.org/10.1007/s11571-022-09845-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11571-022-09845-8

Keywords

Navigation