Skip to main content

Adaptive Learning Rate and Momentum for Training Deep Neural Networks

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12977))

Abstract

Recent progress on deep learning relies heavily on the quality and efficiency of training algorithms. In this paper, we develop a fast training method motivated by the nonlinear Conjugate Gradient (CG) framework. We propose the Conjugate Gradient with Quadratic line-search (CGQ) method. On the one hand, a quadratic line-search determines the step size according to current loss landscape. On the other hand, the momentum factor is dynamically updated in computing the conjugate gradient parameter (like Polak-Ribiere). Theoretical results to ensure the convergence of our method in strong convex settings is developed. And experiments in image classification datasets show that our method yields faster convergence than other local solvers and has better generalization capability (test set accuracy). One major advantage of the paper method is that tedious hand tuning of hyperparameters like the learning rate and momentum is avoided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pacific J. Math. 16(1), 1–3 (1966)

    Article  MathSciNet  Google Scholar 

  2. Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. In: ICLR (2018)

    Google Scholar 

  3. Berrada, L., Zisserman, A., Kumar, M.P.: Training neural networks for and by interpolation. In: International Conference on Machine Learning (2020)

    Google Scholar 

  4. Bhaya, A., Kaszkurewicz, E.: Steepest descent with momentum for quadratic functions is a version of the conjugate gradient method. Neural Netw. 17, 65–71 (2004). https://doi.org/10.1016/S0893-6080(03)00170-9

    Article  MATH  Google Scholar 

  5. Dai, Y.H., Yuan, Y.X.: A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. Optim. 10(1), 177–182 (1999)

    Article  MathSciNet  Google Scholar 

  6. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1646–1654. Curran Associates, Inc. (2014)

    Google Scholar 

  7. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(61), 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  8. Fletcher, R., Reeves, C.M.: Function minimization by conjugate gradients. Comput. J. 7(2), 149–154 (1964). https://doi.org/10.1093/comjnl/7.2.149

  9. Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49, 409–436 (1952)

    Article  MathSciNet  Google Scholar 

  10. Jin, X.B., Zhang, X.Y., Huang, K., Geng, G.G.: Stochastic conjugate gradient algorithm with variance reduction. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1360–1369 (2019). https://doi.org/10.1109/TNNLS.2018.2868835

    Article  MathSciNet  Google Scholar 

  11. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 315–323. Curran Associates, Inc. (2013)

    Google Scholar 

  12. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  13. Kobayashi, Y., Iiduka, H.: Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning (2020). http://arxiv.org/abs/2003.00231

  14. Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.Y.: On optimization methods for deep learning. In: ICML (2011)

    Google Scholar 

  15. Loizou, N., Vaswani, S., Laradji, I., Lacoste-Julien, S.: Stochastic Polyak step-size for SGD: an adaptive learning rate for fast convergence. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1–33 (2021)

    Google Scholar 

  16. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in Adam. arXiv:1711.05101 (2017)

  17. Powell, M.J.D.: An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput. J. 7(2), 155–162 (1964)

    Article  MathSciNet  Google Scholar 

  18. Mutschler, M., Zell, A.: Parabolic approximation line search for DNNs. In: NeurIPS (2020)

    Google Scholar 

  19. Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(o(1/k^2)\). Soviet Mathe. Doklady 27, 372–376 (1983)

    MATH  Google Scholar 

  20. Orabona, F., Pal, D.: Coin betting and parameter-free online learning. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 577–585. Curran Associates, Inc. (2016)

    Google Scholar 

  21. Pesme, S., Dieuleveut, A., Flammarion, N.: On convergence-diagnostic based step sizes for stochastic gradient descent. In: Proceedings of the International Conference on Machine Learning 1 Pre-proceedings (ICML 2020) (2020)

    Google Scholar 

  22. Polak, E., Ribiere, G.: Note sur la convergence de méthodes de directions conjuguées. ESAIM: Math. Model. Numer. Anal. - Modélisation Mathématique et Analyse Numérique 3(R1), 35–43 (1969)

    Google Scholar 

  23. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  24. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. In: The 35th International Conference on Machine Learning (ICML) (2018)

    Google Scholar 

  25. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017)

    Article  MathSciNet  Google Scholar 

  26. Schmidt, M., Roux, N.L.: Fast convergence of stochastic gradient descent under a strong growth condition (2013)

    Google Scholar 

  27. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(1), 567599 (2013)

    Google Scholar 

  28. Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain, August 1994. http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf

  29. Smith, L.N.: Cyclical learning rates for training neural networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2017)

    Google Scholar 

  30. Vandebogert, K.: Method of quadratic interpolation, September 2017. https://people.math.sc.edu/kellerlv/Quadratic_Interpolation.pdf

  31. Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. In: Advances in Neural Information Processing Systems, pp. 3727–3740 (2019)

    Google Scholar 

  32. Wolfe, P.: Convergence conditions for ascent methods. SIAM Rev. 11(2), 226–000 (1969)

    Article  MathSciNet  Google Scholar 

  33. Wolfe, P.: Convergence conditions for ascent methods. II: some corrections. SIAM Rev. 13(2), 185–000 (1969)

    Google Scholar 

  34. Zhang, M., Lucas, J., Ba, J., Hinton, G.E.: Lookahead optimizer: K steps forward, 1 step back. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 9597–9608 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiyong Hao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hao, Z., Jiang, Y., Yu, H., Chiang, HD. (2021). Adaptive Learning Rate and Momentum for Training Deep Neural Networks. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86523-8_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86522-1

  • Online ISBN: 978-3-030-86523-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics