Skip to main content

A Stochastic Modified Limited Memory BFGS for Training Deep Neural Networks

  • Conference paper
  • First Online:
Intelligent Computing (SAI 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 507))

Included in the following conference series:

  • 972 Accesses

Abstract

In this work, we study stochastic quasi-Newton methods for solving the non-linear and non-convex optimization problems arising in the training of deep neural networks. We consider the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) update in the framework of a trust-region approach. We provide an almost comprehensive overview of recent improvements in quasi-Newton based training algorithms, such as accurate selection of the initial Hessian approximation, efficient solution of the trust-region subproblem with a direct method in high accuracy and an overlap sampling strategy to assure stable quasi-Newton updating by computing gradient differences based on this overlap. We provide a comparison of the standard L-BFGS method with a variant of this algorithm based on a modified secant condition which is theoretically shown to provide an increased order of accuracy in the approximation of the curvature of the Hessian. In our experiments, both quasi-Newton updates exhibit comparable performances. Our results show that with a fixed computational time budget the proposed quasi-Newton methods provide comparable or better testing accuracy than the state of the art first-order Adam optimizer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Z-score normalization produces a dataset whose mean and standard deviation is zero and one, respectively.

References

  1. Adhikari, L., DeGuchy, O., Erway, J.B., Lockhart, S., Marcia, R.F.: Limited-memory trust-region methods for sparse relaxation. In: Wavelets and Sparsity XVII, vol. 10394, p. 103940J. International Society for Optics and Photonics (2017)

    Google Scholar 

  2. Berahas, A.L., Nocedal, J., Takáč, M.: A multi-batch L-BFGS method for machine learning. In: Advances in Neural Information Processing Systems, pp. 1055–1063 (2016)

    Google Scholar 

  3. Berahas, A.S., Takáč, M.: A robust multi-batch L-BFGS method for machine learning. Optim. Methods Softw. 35(1), 191–219 (2020)

    Article  MathSciNet  Google Scholar 

  4. Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal. 39(2), 545–578 (2019)

    Article  MathSciNet  Google Scholar 

  5. Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H.-J., Peter Tang, P.T.: A progressive batching L-BFGS method for machine learning. In: International Conference on Machine Learning, PMLR, pp. 620–629 (2018)

    Google Scholar 

  6. Bottou, L., LeCun, Y.: Large scale online learning. Adv. Neural. Inf. Process. Syst. 16, 217–224 (2004)

    Google Scholar 

  7. Brust, J., Erway, J.B., Marcia, R.F.: On solving L-SR1 trust-region subproblems. Comput. Optim. Appl. 66(2), 245–266 (2016)

    Article  MathSciNet  Google Scholar 

  8. Burdakov, O., Gong, L., Zikrin, S., Yuan, Y.: On efficiently combining limited-memory and trust-region techniques. Math. Program. Comput. 9(1), 101–134 (2016)

    Article  MathSciNet  Google Scholar 

  9. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)

    Article  MathSciNet  Google Scholar 

  10. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust region methods. SIAM (2000)

    Google Scholar 

  11. Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, vol. 4, pp. 2933–2941 (2014)

    Google Scholar 

  12. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in neural information processing systems, pp. 1646–1654 (2014)

    Google Scholar 

  13. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  14. Erway, J.B., Griffin, J., Marcia, R.F., Omheni, R.: Trust-region algorithms for training responses: machine learning methods using indefinite hessian approximations. Optim. Methods Softw. 35(3), 460–487 (2020)

    Article  MathSciNet  Google Scholar 

  15. Gay, D.M.: Computing optimal locally constrained steps. SIAM J. Sci. Statis. Comput. 2(2), 186–197 (1981)

    Article  MathSciNet  Google Scholar 

  16. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference On Artificial Intelligence And Statistics, JMLR Workshop and Conference Proceedings, pp. 249–256 (2010)

    Google Scholar 

  17. Goldfarb, D., Ren, Y., Bahamou, A.: Practical quasi-Newton methods for training deep neural networks (2020). arXiv preprint, arXiv:2006.08877

  18. Golub, G.H., Van Loan, C.F.: Matrix computations, 4th edn. Johns Hopkins University Press (2013)

    Google Scholar 

  19. Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869–1878 (2016)

    Google Scholar 

  20. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural. Inf. Process. Syst. 26, 315–323 (2013)

    Google Scholar 

  21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015)

    Google Scholar 

  22. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009). https://www.cs.toronto.edu/~kriz/cifar.html

  23. Kungurtsevm V., Pevny, T.: Algorithms for solving optimization problems arising from deep neural net models: smooth problems (2018). arXiv preprint, arXiv:1807.00172

  24. Kylasa, S., Roosta, F., Mahoney, M.W., Grama, A.: GPU accelerated sub-sampled Newton’s method for convex classification problems. In: Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 702–710. SIAM (2019)

    Google Scholar 

  25. LeCun, Y.: The MNIST database of handwritten digits (1998). http://yann.lecun.com/exdb/mnist/

  26. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  27. Lucchi, A., McWilliams, B., Hofmann, T.: A variance reduced stochastic Newton method (2015). arXiv preprint, arXiv:1503.08316

  28. Martens, J., Sutskever, I.: Training deep and recurrent networks with hessian-free optimization. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 479–535. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_27

    Chapter  Google Scholar 

  29. Modarres, F., Malik, A.H., Leong, W.J.: Improved hessian approximation with modified secant equations for symmetric rank-one method. J. Comput. Appli. Math. 235(8), 2423–2431 (2011)

    Article  MathSciNet  Google Scholar 

  30. Mokhtari, A., Ribeiro, A.: Res: regularized stochastic BFGS algorithm. IEEE Trans. Signal Process. 62(23), 6089–6104 (2014)

    Article  MathSciNet  Google Scholar 

  31. Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16(1), 3151–3181 (2015)

    MathSciNet  MATH  Google Scholar 

  32. Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM J. Sci. Stat. Comput. 4(3), 553–572 (1983)

    Article  MathSciNet  Google Scholar 

  33. Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258 (2016)

    Google Scholar 

  34. Nocedal, J., Wright, S.: Numerical optimization. Springer Science & Business Media (2006). https://doi.org/10.1007/978-0-387-40065-5

  35. Rafati, J., Marcia, R.F.: Improving L-BFGS initialization for trust-region methods in deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications, ICMLA, pp. 501–508. IEEE (2018)

    Google Scholar 

  36. Ramamurthy, V., Duffy, N.: L-SR1: a second order optimization method for deep learning (2016)

    Google Scholar 

  37. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  Google Scholar 

  38. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)

    Article  MathSciNet  Google Scholar 

  39. Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-Newton method for online convex optimization. In: Artificial intelligence and statistics, PMLR, pp. 436–443 (2007)

    Google Scholar 

  40. Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017)

    Article  MathSciNet  Google Scholar 

  41. Wei, Z., Li, G., Qi, L.: New quasi-Newton methods for unconstrained optimization problems. Appl. Math. Comput. 175(2), 1156–1188 (2006)

    MathSciNet  MATH  Google Scholar 

  42. Xu, P., Roosta, F., Mahoney, M.W.: Second-order optimization for non-convex machine learning: an empirical study. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 199–207. SIAM (2020)

    Google Scholar 

  43. Ziyin, L., Li, B., Ueda, M.: SGD may never escape saddle points (2021). arXiv preprint, arXiv:2107.11774v1

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ángeles Martínez Calomardo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yousefi, M., Martínez Calomardo, Á. (2022). A Stochastic Modified Limited Memory BFGS for Training Deep Neural Networks. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 507. Springer, Cham. https://doi.org/10.1007/978-3-031-10464-0_2

Download citation

Publish with us

Policies and ethics