On hyper-parameter selection for guaranteed convergence of RMSProp

Liu, Jinlan; Xu, Dongpo; Zhang, Huisheng; Mandic, Danilo

doi:10.1007/s11571-022-09845-8

On hyper-parameter selection for guaranteed convergence of RMSProp

Research Article
Published: 28 July 2022

(2022)
Cite this article

Cognitive Neurodynamics Aims and scope Submit manuscript

Jinlan Liu¹,
Dongpo Xu ORCID: orcid.org/0000-0002-9663-9743¹,
Huisheng Zhang² &
…
Danilo Mandic³

431 Accesses
5 Citations
Explore all metrics

Abstract

RMSProp is one of the most popular stochastic optimization algorithms in deep learning applications. However, recent work has pointed out that this method may not converge to the optimal solution even in simple convex settings. To this end, we propose a time-varying version of RMSProp to fix the non-convergence issues. Specifically, the hyperparameter, \(\beta _t\), is considered as a time-varying sequence rather than a fine-tuned constant. We also provide a rigorous proof that the RMSProp can converge to critical points even for smooth and non-convex objectives, with a convergence rate of order \(\mathcal {O}(\log T/\sqrt{T})\). This provides a new understanding of RMSProp divergence, a common issue in practical applications. Finally, numerical experiments show that time-varying RMSProp exhibits advantages over standard RMSProp on benchmark datasets and support the theoretical results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

Article Open access 26 July 2022

A survey of transfer learning

Article Open access 28 May 2016

A review on the long short-term memory model

Article 13 May 2020

Data availability

The datasets analysed during the current study are available in the following public domain resources: http://yann.lecun.com/exdb/mnist/; http://www.cs.toronto.edu/~kriz/cifar.html.

Notes

References

Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Belmont
Google Scholar
Bottou L, Bousquet O (2007) The tradeoffs of large scale learning. In: Proc Adv Neural Inf Process Syst, Vancouver, CA, pp 161–168
Chen X, Liu S, Sun R, et al. (2019) On the convergence of a class of Adam-type algorithms for non-convex optimization. In: Proc Int Conf Learn Repres, New Orleans, USA
De S, Mukherjee A, Ullah E (2018) Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv preprint arXiv:1807.06766
Défossez A, Bottou L, Bach F, et al. (2020) A simple convergence proof of Adam and Adagrad. arXiv preprint arXiv:2003.02395
Dozat T (2016) Incorporating Nesterov momentum into Adam. In: Proc Int Conf Learn Repres, San Juan, Puerto Rico
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):2121–2159
Google Scholar
Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program 156(1–2):59–99. https://doi.org/10.1007/s10107-015-0871-8
Article Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, London
Google Scholar
He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. In: Proc IEEE Conf Comp Vis Patt Recogn, pp 770–778, https://doi.org/10.1109/CVPR.2016.90
Huang H, Wang C, Dong B (2019) Nostalgic Adam: weighting more of the past gradients when designing the adaptive learning rate. In: Proc Int Joint Conf Artif Intell, Macao, China, pp 2556–2562, https://doi.org/10.24963/ijcai.2019/355
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proc Int Conf Learn Repres, San Diego, USA
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Dissertation, University of Toronto, Toronto
LeCun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Article Google Scholar
Lin Z, Bai Z (2011) Probability inequalities. Springer, Beijing. https://doi.org/10.1007/978-3-642-05261-3
Book Google Scholar
Liu J, Kong J, Xu D et al (2022) Convergence analysis of Adabound with relaxed bound functions for non-convex optimization. Neural Netw 145:300–307. https://doi.org/10.1016/j.neunet.2021.10.026
Article PubMed Google Scholar
Loshchilov I, Hutter F (2018) Fixing weight decay regularization in adam. In: Proc Int Conf Learn Repres, Vancouver, CA
Luo J, Liu J, Xu D et al (2022) SGD-r\(\alpha \): areal-time \(\alpha \)-suffix averaging method for SGD with biased gradient estimates. Neurocomputing 487:1–8. https://doi.org/10.1016/j.neucom.2022.02.063
Article Google Scholar
Luo L, Xiong Y, Liu Y, et al. (2019) Adaptive gradient methods with dynamic bound of learning rate. In: Proc Int Conf Learn Repres, New Orleans, USA
Mandic D, Chambers J (2001) Recurrent neural networks for prediction: learning algorithms, architectures and stability. Wiley, Chichester. https://doi.org/10.1002/047084535X
Book Google Scholar
Nesterov Y (2003) Introductory lectures on convex optimization: a basic course,. Springer, New York. https://doi.org/10.1007/978-1-4419-8853-9
Book Google Scholar
Reddi S, Kale S, Kumar S (2018a) On the convergence of Adam and beyond. In: Proc Int Conf Learn Repres, Vancouver, CA
Reddi S, Zaheer M, Sachan D, et al. (2018b) Adaptive methods for nonconvex optimization. In: Proc Adv Neural Inf Process Syst, Montréal, CA
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Statistics pp 400–407. https://doi.org/10.1214/aoms/1177729586
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Article PubMed Google Scholar
Shi N, Li D, Hong M, et al. (2020) RMSProp converges with proper hyper-parameter. In: Proc Int Conf Learn Repres, Addis Ababa, Ethiopia
Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Proc Adv Neural Inf Process Syst, Nevada, USA, pp 2951–2959, https://doi.org/10.5555/2999325.2999464
Tieleman T, Hinton G et al (2012) Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31
Google Scholar
Xu D, Zhang S, Zhang H et al (2021) Convergence of the RMSProp deep learning method with penalty for nonconvex optimization. Neural Netw 139:17–23. https://doi.org/10.1016/j.neunet.2021.02.011
Article PubMed Google Scholar
Yan Y, Yang T, Li Z, et al. (2018) A unified analysis of stochastic momentum methods for deep learning. In: Proc Int Joint Conf Artif Intell, Stockholm, Sweden, pp 2955–2961, https://doi.org/10.24963/ijcai.2018/410
Zhou D, Chen J, Cao Y, et al. (2018) On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671
Zhou Z, Zhang Q, Lu G, et al. (2019) Adashift: Decorrelation and convergence of adaptive learning rate methods. In: Proc Int Conf Learn Repres, New Orleans, USA
Zou F, Shen L, Jie Z, et al. (2018) Weighted adagrad with unified momentum. arXiv preprint arXiv:1808.03408
Zou F, Shen L, Jie Z, et al. (2019) A sufficient condition for convergences of Adam and RMSProp. In: Proc IEEE Conf Comp Vis Patt Recogn, Long Beach, USA, pp 11,127–11,135, https://doi.org/10.1109/CVPR.2019.01138

Download references

Funding

This work was funded in part by the National Natural Science Foundation of China (Nos. 62176051, 61671099), in part by National Key R &D Program of China (No. 2020YFA0714102), and in part by the Fundamental Research Funds for the Central Universities of China (No. 2412020FZ024).

Author information

Authors and Affiliations

School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, China
Jinlan Liu & Dongpo Xu
Department of Mathematics, Dalian Maritime University, Dalian, 116026, China
Huisheng Zhang
Department of Electrical and Electronic Engineering, Imperial College London, London, SW7 2AZ, UK
Danilo Mandic

Authors

Jinlan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongpo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Huisheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Danilo Mandic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongpo Xu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

Ethics approval is not applicable for this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, J., Xu, D., Zhang, H. et al. On hyper-parameter selection for guaranteed convergence of RMSProp. Cogn Neurodyn (2022). https://doi.org/10.1007/s11571-022-09845-8

Download citation

Received: 14 March 2022
Revised: 09 June 2022
Accepted: 06 July 2022
Published: 28 July 2022
DOI: https://doi.org/10.1007/s11571-022-09845-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On hyper-parameter selection for guaranteed convergence of RMSProp

Abstract

Access this article

Similar content being viewed by others

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

A survey of transfer learning

A review on the long short-term memory model

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On hyper-parameter selection for guaranteed convergence of RMSProp

Abstract

Access this article

Similar content being viewed by others

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

A survey of transfer learning

A review on the long short-term memory model

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation