Adaptive Learning Rate and Momentum for Training Deep Neural Networks

Hao, Zhiyong; Jiang, Yixuan; Yu, Huihua; Chiang, Hsiao-Dong

doi:10.1007/978-3-030-86523-8_23

Zhiyong Hao¹³,
Yixuan Jiang¹³,
Huihua Yu¹³ &
…
Hsiao-Dong Chiang¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12977))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1832 Accesses
5 Citations

Abstract

Recent progress on deep learning relies heavily on the quality and efficiency of training algorithms. In this paper, we develop a fast training method motivated by the nonlinear Conjugate Gradient (CG) framework. We propose the Conjugate Gradient with Quadratic line-search (CGQ) method. On the one hand, a quadratic line-search determines the step size according to current loss landscape. On the other hand, the momentum factor is dynamically updated in computing the conjugate gradient parameter (like Polak-Ribiere). Theoretical results to ensure the convergence of our method in strong convex settings is developed. And experiments in image classification datasets show that our method yields faster convergence than other local solvers and has better generalization capability (test set accuracy). One major advantage of the paper method is that tedious hand tuning of hyperparameters like the learning rate and momentum is avoided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pacific J. Math. 16(1), 1–3 (1966)
Article MathSciNet Google Scholar
Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. In: ICLR (2018)
Google Scholar
Berrada, L., Zisserman, A., Kumar, M.P.: Training neural networks for and by interpolation. In: International Conference on Machine Learning (2020)
Google Scholar
Bhaya, A., Kaszkurewicz, E.: Steepest descent with momentum for quadratic functions is a version of the conjugate gradient method. Neural Netw. 17, 65–71 (2004). https://doi.org/10.1016/S0893-6080(03)00170-9
Article MATH Google Scholar
Dai, Y.H., Yuan, Y.X.: A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. Optim. 10(1), 177–182 (1999)
Article MathSciNet Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1646–1654. Curran Associates, Inc. (2014)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(61), 2121–2159 (2011)
MathSciNet MATH Google Scholar
Fletcher, R., Reeves, C.M.: Function minimization by conjugate gradients. Comput. J. 7(2), 149–154 (1964). https://doi.org/10.1093/comjnl/7.2.149
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49, 409–436 (1952)
Article MathSciNet Google Scholar
Jin, X.B., Zhang, X.Y., Huang, K., Geng, G.G.: Stochastic conjugate gradient algorithm with variance reduction. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1360–1369 (2019). https://doi.org/10.1109/TNNLS.2018.2868835
Article MathSciNet Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 315–323. Curran Associates, Inc. (2013)
Google Scholar
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Kobayashi, Y., Iiduka, H.: Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning (2020). http://arxiv.org/abs/2003.00231
Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.Y.: On optimization methods for deep learning. In: ICML (2011)
Google Scholar
Loizou, N., Vaswani, S., Laradji, I., Lacoste-Julien, S.: Stochastic Polyak step-size for SGD: an adaptive learning rate for fast convergence. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1–33 (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in Adam. arXiv:1711.05101 (2017)
Powell, M.J.D.: An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput. J. 7(2), 155–162 (1964)
Article MathSciNet Google Scholar
Mutschler, M., Zell, A.: Parabolic approximation line search for DNNs. In: NeurIPS (2020)
Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(o(1/k^2)\). Soviet Mathe. Doklady 27, 372–376 (1983)
MATH Google Scholar
Orabona, F., Pal, D.: Coin betting and parameter-free online learning. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 577–585. Curran Associates, Inc. (2016)
Google Scholar
Pesme, S., Dieuleveut, A., Flammarion, N.: On convergence-diagnostic based step sizes for stochastic gradient descent. In: Proceedings of the International Conference on Machine Learning 1 Pre-proceedings (ICML 2020) (2020)
Google Scholar
Polak, E., Ribiere, G.: Note sur la convergence de méthodes de directions conjuguées. ESAIM: Math. Model. Numer. Anal. - Modélisation Mathématique et Analyse Numérique 3(R1), 35–43 (1969)
Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. In: The 35th International Conference on Machine Learning (ICML) (2018)
Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017)
Article MathSciNet Google Scholar
Schmidt, M., Roux, N.L.: Fast convergence of stochastic gradient descent under a strong growth condition (2013)
Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(1), 567599 (2013)
Google Scholar
Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain, August 1994. http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf
Smith, L.N.: Cyclical learning rates for training neural networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2017)
Google Scholar
Vandebogert, K.: Method of quadratic interpolation, September 2017. https://people.math.sc.edu/kellerlv/Quadratic_Interpolation.pdf
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. In: Advances in Neural Information Processing Systems, pp. 3727–3740 (2019)
Google Scholar
Wolfe, P.: Convergence conditions for ascent methods. SIAM Rev. 11(2), 226–000 (1969)
Article MathSciNet Google Scholar
Wolfe, P.: Convergence conditions for ascent methods. II: some corrections. SIAM Rev. 13(2), 185–000 (1969)
Google Scholar
Zhang, M., Lucas, J., Ba, J., Hinton, G.E.: Lookahead optimizer: K steps forward, 1 step back. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 9597–9608 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Cornell University, Ithaca, NY, 14850, USA
Zhiyong Hao, Yixuan Jiang, Huihua Yu & Hsiao-Dong Chiang

Authors

Zhiyong Hao
View author publications
You can also search for this author in PubMed Google Scholar
Yixuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Huihua Yu
View author publications
You can also search for this author in PubMed Google Scholar
Hsiao-Dong Chiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyong Hao .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hao, Z., Jiang, Y., Yu, H., Chiang, HD. (2021). Adaptive Learning Rate and Momentum for Training Deep Neural Networks. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-86523-8_23
Published: 11 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86522-1
Online ISBN: 978-3-030-86523-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)