Skip to main content
Log in

A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

In this paper, a novel stochastic extra-step quasi-Newton method is developed to solve a class of nonsmooth nonconvex composite optimization problems. We assume that the gradient of the smooth part of the objective function can only be approximated by stochastic oracles. The proposed method combines general stochastic higher order steps derived from an underlying proximal type fixed-point equation with additional stochastic proximal gradient steps to guarantee convergence. Based on suitable bounds on the step sizes, we establish global convergence to stationary points in expectation and an extension of the approach using variance reduction techniques is discussed. Motivated by large-scale and big data applications, we investigate a stochastic coordinate-type quasi-Newton scheme that allows to generate cheap and tractable stochastic higher order directions. Finally, numerical results on large-scale logistic regression and deep learning problems show that our proposed algorithm compares favorably with other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Agarwal, N., Bullins, B., Hazan, E.: Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res. 18(116), 1–40 (2017)

    MathSciNet  MATH  Google Scholar 

  2. Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch SGD: training Resnet-50 on ImageNet in 15 minutes (2017). http://arxiv.org/abs/1711.04325

  3. Allen-Zhu, Z.: Katyusha: The first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205 (2017)

  4. Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: Proceedings of the 33rd International Conference on Machine Learning, 699–707 (2016)

  5. Andrew, G., Gao, J.: Scalable training of \(\ell _1\)-regularized log-linear models. In: Proceedings of the 24th International Conference on Machine Learning, 33–40 (2007)

  6. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2011)

    MATH  Google Scholar 

  7. Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces. CMS books in mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York (2011)

    Google Scholar 

  8. Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of Newton-sketch and subsampled Newton methods. Optim. Methods Softw. 35(4), 661–680 (2020). https://doi.org/10.1080/10556788.2020.1725751

    Article  MathSciNet  MATH  Google Scholar 

  9. Berahas, A.S., Nocedal, J., Takác, M.: A multi-batch L-BFGS method for machine learning. In: Advances in Neural Information Processing Systems, pp. 1063–1071 (2016)

  10. Bishop, C.M.: Pattern recognition and machine learning. information science and statistics. Springer, New York (2006)

    MATH  Google Scholar 

  11. Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal. 39, 1–34 (2018)

    MathSciNet  MATH  Google Scholar 

  12. Botev, A., Ritter, H., Barber, D.: Practical Gauss-Newton optimization for deep learning. In: Proceedings of the 34th International Conference on Machine Learning, 557–565 (2017)

  13. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    MathSciNet  MATH  Google Scholar 

  14. Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–995 (2011)

    MathSciNet  MATH  Google Scholar 

  15. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)

    MathSciNet  MATH  Google Scholar 

  16. Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)

    MathSciNet  MATH  Google Scholar 

  17. Censor, Y., Gibali, A., Reich, S.: The subgradient extragradient method for solving variational inequalities in hilbert space. J. Optim. Theor. Appl. 148(2), 318–335 (2011)

    MathSciNet  MATH  Google Scholar 

  18. Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Willsky, A.S.: Sparse and low-rank matrix decompositions. In: 27th Annual Allerton Conference on Communication, Control and Computing, 42: 1493–1498 (2009)

  19. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM. Trans. Intell. Syst. Technol. 2(3), 27 (2011)

    Google Scholar 

  20. Chen, X., Qi, L.: A parameterized Newton method and a quasi-Newton method for nonsmooth equations. Comput. Optim. Appl. 3(2), 157–179 (1994)

    MathSciNet  MATH  Google Scholar 

  21. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. in: fixed-point algorithms for inverse problems in science and engineering. Springer, New York (2011)

    MATH  Google Scholar 

  22. Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4(4), 1168–1200 (2005)

    MathSciNet  MATH  Google Scholar 

  23. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust-region methods. MPS/SIAM series on optimization. SIAM. MPS, Philadelphia (2000)

    Google Scholar 

  24. Davis, D., Drusvyatskiy, D.: Stochastic subgradient method converges at the rate O\((k^{-1/4})\) on weakly convex functions (2018). http://arxiv.org/abs/1802.02988

  25. Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019). https://doi.org/10.1137/18M1178244

    Article  MathSciNet  MATH  Google Scholar 

  26. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 1646–1654 (2014)

  27. Deng, L., Yu, D.: Deep learning: methods and applications. Found. Trends Signal Process. 7, 197–387 (2014)

    MathSciNet  MATH  Google Scholar 

  28. Dong, Y.: An extension of Luque’s growth condition. Appl. Math. Lett. 22(9), 1390–1393 (2009)

    MathSciNet  MATH  Google Scholar 

  29. Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)

    MathSciNet  MATH  Google Scholar 

  30. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  31. Durrett, R.: Probability: theory and examples, vol. 49. Cambridge University Press, Cambridge (2019)

    MATH  Google Scholar 

  32. Erdogdu, M.A., Montanari, A.: Convergence rates of sub-sampled Newton methods. In: Advances in Neural Information Processing Systems, pp. 28 (2015)

  33. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  34. Fang, C., Li, C.J., Lin, Z., Zhang, T.: SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 687–697 (2018)

  35. Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

    MathSciNet  MATH  Google Scholar 

  36. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)

    MathSciNet  MATH  Google Scholar 

  37. Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)

    MathSciNet  MATH  Google Scholar 

  38. Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: Squeezing more curvature out of data. In: Proceedings of the 33rd International Conference on Machine Learning, 1869–1878 (2016)

  39. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: Training ImageNet in 1 hour (2017). http://arxiv.org/abs/1706.02677

  40. Grosse, R., Martens, J.: A Kronecker-factored approximate Fisher matrix for convolution layers. In: Proceedings of the 33rd International Conference on Machine Learning, 573–582 (2016)

  41. Hastie, T., Tibshirani, R., Friedman, J.: Data mining, inference, and prediction. the elements of statistical learning Springer series in statistics. Springer-Verlag, New York (2001)

    MATH  Google Scholar 

  42. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  43. Hsieh, C.J., Sustik, M.A., Dhillon, I.S., Ravikumar, P.: QUIC: quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res. 15(1), 2911–2947 (2014)

    MathSciNet  MATH  Google Scholar 

  44. Iusem, A.N., Jofré, A., Oliveira, R.I., Thompson, P.: Extragradient method with variance reduction for stochastic variational inequalities. SIAM J. Optim. 27(2), 686–724 (2017)

    MathSciNet  MATH  Google Scholar 

  45. Janka, D., Kirches, C., Sager, S., Wächter, A.: An SR1/BFGS SQP algorithm for nonconvex nonlinear programs with block-diagonal Hessian matrix. Math. Program. Comput. 8(4), 435–459 (2016)

    MathSciNet  MATH  Google Scholar 

  46. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. in Neural Inf. Process. Syst. 26, 315–323 (2013)

    Google Scholar 

  47. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. http://arxiv.org/abs/1412.6980 (2014)

  48. Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: Proceedings of the 34th International Conference on Machine Learning, 70:. 1895–1904 (2017)

  49. Konečnỳ, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Topics in Signal Process. 10(2), 242–255 (2016)

    Google Scholar 

  50. Korpelevich, G.: The extragradient method for finding saddle points and other problems. Matecon 12, 747–756 (1976)

    MathSciNet  MATH  Google Scholar 

  51. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Google Scholar 

  52. Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014)

    MathSciNet  MATH  Google Scholar 

  53. Lei, L., Ju, C., Chen, J., Jordan, M.I.: Non-convex finite-sum optimization via SCSG methods. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2345–2355 (2017)

  54. Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384–3392 (2015)

  55. Lin, T., Ma, S., Zhang, S.: An extragradient-based alternating direction method for convex minimization. Found. Comput. Math. 17(1), 35–59 (2017)

    MathSciNet  MATH  Google Scholar 

  56. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(3), 503–528 (1989)

    MathSciNet  MATH  Google Scholar 

  57. Liu, H., So, A.M.C., Wu, W.: Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods. Math. Program. 178, 215–262 (2018)

    MATH  Google Scholar 

  58. Liu, X., Hsieh, C.J.: Fast variance reduction method with stochastic batch size. In: Proceedings of the 35th International Conference on Machine Learning, 3185–3194 (2018)

  59. Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1–4), 157–178 (1993)

    MathSciNet  MATH  Google Scholar 

  60. LIBLINEAR: A library for large linear classification. http://www.csie.ntu.edu.tw/~cjlin/liblinear

  61. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: Proceedings of the 26th International Conference on Machine Learning, 689–696 (2009)

  62. Mannel, F., Rund, A.: A hybrid semismooth quasi-Newton method for structured nonsmooth operator equations in Banach spaces (2018). https://imsc.uni-graz.at/mannel/sqn1.pdf

  63. Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning, 27: 735–742 (2010)

  64. Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-factored approximate curvature. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 2408–2417 (2015)

  65. Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent in function space. In: Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 512–518 (1999)

  66. Milzarek, A., Xiao, X., Cen, S., Wen, Z., Ulbrich, M.: A stochastic semismooth Newton method for nonsmooth nonconvex optimization. SIAM J. Optim. 29(4), 2916–2948 (2019)

    MathSciNet  MATH  Google Scholar 

  67. Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: An incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670–1698 (2018)

    MathSciNet  MATH  Google Scholar 

  68. Mokhtari, A., Ribeiro, A.: RES: regularized stochastic BFGS algorithm. IEEE Trans. Signal Process. 62(23), 6089–6104 (2014)

    MathSciNet  MATH  Google Scholar 

  69. Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16, 3151–3181 (2015)

    MathSciNet  MATH  Google Scholar 

  70. Monteiro, R.D., Svaiter, B.F.: Complexity of variants of tseng’s modified fb splitting and korpelevich’s methods for hemivariational inequalities with applications to saddle-point and convex optimization problems. SIAM J. Optim. 21(4), 1688–1720 (2011)

    MathSciNet  MATH  Google Scholar 

  71. Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. FR. 93, 273–299 (1965)

    MATH  Google Scholar 

  72. Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Proceedings of the 19th Conference on Artificial Intelligence and Statistics, pp. 249–258 (2016)

  73. Mutný, M.: Stochastic second-order optimization via Neumann series (2016). http://arxiv.org/abs/1612.04694

  74. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    MathSciNet  MATH  Google Scholar 

  75. Nguyen, L.M., van Dijk, M., Phan, D.T., Nguyen, P.H., Weng, T.W., Kalagnanam, J.R.: Finite-sum smooth optimization with SARAH (2019). http://arxiv.org/abs/1901.07648v2

  76. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: A novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, 2613–2621 (2017)

  77. Nguyen, T.P., Pauwels, E., Richard, E., Suter, B.W.: Extragradient method in optimization: convergence and complexity. J. Optim. Theory Appl. 176(1), 137–162 (2018)

    MathSciNet  MATH  Google Scholar 

  78. Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comp. 35(151), 773–782 (1980)

    MathSciNet  MATH  Google Scholar 

  79. Osawa, K., Tsuji, Y., Ueno, Y., Naruse, A., Yokota, R., Matsuoka, S.: Large-scale distributed second-order optimization using Kronecker-factored approximate curvature for deep convolutional neural networks (2018). http://arxiv.org/abs/1811.12019

  80. Pang, J.S., Qi, L.: Nonsmooth equations: motivation and algorithms. SIAM J. Optim. 3(3), 443–465 (1993)

    MathSciNet  MATH  Google Scholar 

  81. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)

    Google Scholar 

  82. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in Pytorch. In: Proceedings of the 31th International Conference on Neural Information Processing Systems (2017)

  83. Patrinos, P., Stella, L., Bemporad, A.: Forward-backward truncated Newton methods for convex composite optimization (2014). http://arxiv.org/abs/1402.6655

  84. Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 1–48 (2020)

    MathSciNet  MATH  Google Scholar 

  85. Pilanci, M., Wainwright, M.J.: Newton sketch: a near linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optim. 27(1), 205–245 (2017)

    MathSciNet  MATH  Google Scholar 

  86. Poon, C., Liang, J., Schoenlieb, C.: Local convergence properties of SAGA/Prox-SVRG and acceleration. In: Proceedings of the 35th International Conference on Machine Learning, 80: 4124–4132 (2018)

  87. Qi, L.: Convergence analysis of some algorithms for solving nonsmooth equations. Math. Oper. Res. 18(1), 227–244 (1993)

    MathSciNet  MATH  Google Scholar 

  88. Qi, L.: On superlinear convergence of quasi-Newton methods for nonsmooth equations. Oper. Res. Lett. 20(5), 223–228 (1997)

    MathSciNet  MATH  Google Scholar 

  89. Qi, L., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58(3), 353–367 (1993)

    MathSciNet  MATH  Google Scholar 

  90. Reddi, S.J., Hefny, A., Sra, S., Póczos, B., Smola, A.J.: Stochastic variance reduction for nonconvex optimization. In: Proceedings of the 33th International Conference on Machine Learning, 314–323 (2016)

  91. Reddi, S.J., Sra, S., Póczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, 1145–1153 (2016)

  92. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    MathSciNet  MATH  Google Scholar 

  93. Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Optimizing Methods in Statistics, pp. 233–257. Academic Press (1971)

  94. Rodomanov, A., Kropotov, D.: A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In: Proceeding of the 33rd International Conference on Machine Learning, 2597–2605 (2016)

  95. Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods. Math. Program. 76, 1–34 (2018)

    MATH  Google Scholar 

  96. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)

    Google Scholar 

  97. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)

    MathSciNet  MATH  Google Scholar 

  98. Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-Newton method for online convex optimization. In: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, 436–443 (2007)

  99. Shalev-Shwartz, S., Ben-David, S.: Understanding machine learning: from theory to algorithms. Cambridge University Press, Cambridge (2014)

    MATH  Google Scholar 

  100. Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)

    MathSciNet  MATH  Google Scholar 

  101. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)

    MathSciNet  MATH  Google Scholar 

  102. Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155(1–2), 105–145 (2016)

    MathSciNet  MATH  Google Scholar 

  103. Shi, J., Yin, W., Osher, S., Sajda, P.: A fast hybrid algorithm for large-scale \(\ell _1\)-regularized logistic regression. J. Mach. Learn. Res. 11, 713–741 (2010)

    MathSciNet  MATH  Google Scholar 

  104. Shi, Z., Liu, R.: Large scale optimization with proximal stochastic Newton-type gradient descent In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 691–704. Springer, Cham (2015)

  105. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. http://arxiv.org/abs/1409.1556 (2014)

  106. Stella, L., Themelis, A., Patrinos, P.: Forward-backward quasi-Newton methods for nonsmooth optimization problems. Comput. Optim. Appl. 67(3), 443–487 (2017)

    MathSciNet  MATH  Google Scholar 

  107. Sun, D., Han, J.: Newton and quasi-Newton methods for a class of nonsmooth equations and related problems. SIAM J. Optim. 7(2), 463–480 (1997)

    MathSciNet  MATH  Google Scholar 

  108. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Macine Learning, 1139–1147 (2013)

  109. Themelis, A., Stella, L., Patrinos, P.: Forward-backward envelope for the sum of two nonconvex functions: further properties and nonmonotone linesearch algorithms. SIAM J. Optim. 28(3), 2274–2303 (2018)

    MathSciNet  MATH  Google Scholar 

  110. Vapnik, V.: The nature of statistical learning theory. Springer Science and Business Media, New York (2013)

    MATH  Google Scholar 

  111. Wang, J., Zhang, T.: Utilizing second order information in minibatch stochastic variance reduced proximal iterations. J. Mach. Learn. Res. 20(42), 1–56 (2019)

    MathSciNet  MATH  Google Scholar 

  112. Wang, X., Ma, C., Li, M.: A globally and superlinearly convergent quasi-Newton method for general box constrained variational inequalities without smoothing approximation. J. Global Optim. 50(4), 675–694 (2011)

    MathSciNet  MATH  Google Scholar 

  113. Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization. SIAM J. Optim. 27(2), 927–956 (2017)

    MathSciNet  MATH  Google Scholar 

  114. Wang, X., Yuan, Y.X.: Stochastic proximal quasi-Newton methods for non-convex composite optimization. Optim. Methods Softw. 34, 922–948 (2019)

    MathSciNet  MATH  Google Scholar 

  115. Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost and momentum: faster stochastic variance reduction algorithms. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)

  116. Wen, Z., Yin, W., Goldfarb, D., Zhang, Y.: A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization, and continuation. SIAM J. Sci. Comput. 32(4), 1832–1857 (2010)

    MathSciNet  MATH  Google Scholar 

  117. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)

    MathSciNet  MATH  Google Scholar 

  118. Xiao, X., Li, Y., Wen, Z., Zhang, L.: A regularized semi-smooth Newton method with projection steps for composite convex programs. J. Sci. Comput. 76(1), 364–389 (2018)

    MathSciNet  MATH  Google Scholar 

  119. Xu, P., Roosta, F., Mahoney, M.W.: Newton-type methods for non-convex optimization under inexact Hessian information. Math. Program. 184, 35–70 (2019)

    MathSciNet  MATH  Google Scholar 

  120. Xu, P., Yang, J., Roosta-Khorasani, F., Ré, C., Mahoney, M.W.: Sub-sampled Newton methods with non-uniform sampling. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3008–3016 (2016)

  121. Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)

    MathSciNet  MATH  Google Scholar 

  122. Ye, H., Luo, L., Zhang, Z.: Approximate Newton methods and their local convergence. In: Proceedings of the 34th International Conference on Machine Learning, 70: 3931–3939 (2017)

  123. You, Y., Zhang, Z., Hsieh, C.J., Demmel, J., Keutzer, K.: ImageNet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, 1–10 (2018)

  124. Yuan, G.X., Ho, C.H., Lin, C.J.: An improved GLMNET for \(\ell _1\)-regularized logistic regression. J. Mach. Learn. Res. 13, 1999–2030 (2012)

    MathSciNet  MATH  Google Scholar 

  125. Zhang, H., Reddi, S.J., Sra, S.: Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems, 4592–4600 (2016)

  126. Zhao, R., Haskell, W.B., Tan, V.Y.: Stochastic L-BFGS: improved convergence rates and practical acceleration strategies. IEEE Trans. Signal Process 66, 1155–1169 (2017)

    MathSciNet  MATH  Google Scholar 

  127. Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. J. Mach. Learn. Res. 21, 1–63 (2018)

Download references

Acknowledgements

The authors are grateful to the associate editor and two anonymous referees for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andre Milzarek.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

M. Yang and Z. Wen are partly supported by Key-Area Research and Development Program of Guangdong Province (No. 2019B121204008), the NSFC grant 11831002 and by Beijing Academy of Artificial Intelligence.

A. Milzarek is partly supported by the Fundamental Research Fund – Shenzhen Research Institute of Big Data (SRIBD) Startup Fund JCYJ-AM20190601 and by the Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS)

9 Appendix: proofs of auxiliary results

9 Appendix: proofs of auxiliary results

1.1 9.1 Proof of Lemma 3.2

Proof

As in [25, Lemma 3.2], applying the definition and characterization (2.3) of the proximal operator, it follows

$$\begin{aligned} {\bar{x}} = {\mathrm {prox}}^{}_{\theta \psi }(x) \,&\iff \, {\bar{x}} \in x - \theta [\nabla f({\bar{x}}) + \partial \varphi ({\bar{x}})] \\&\iff \, {\bar{x}}\in {\bar{x}} - \lambda _{+}[ \nabla f({\bar{x}}) + \theta ^{-1} ({\bar{x}} - x)] - \lambda _{+} \partial \varphi ({\bar{x}}) \\&\iff \, {\bar{x}} = {\mathrm {prox}}^{}_{\lambda _{+}\varphi }({\bar{x}} - \lambda _{+}\nabla f({\bar{x}}) - \lambda _{+}\theta ^{-1}[{\bar{x}} - x]). \end{aligned}$$

Now, setting \(z := x + \beta d\) and \(q := \alpha d + \lambda _{+}(\nabla f(x) - \nabla f(z))\), we have \(p = \lambda _{+}[\nabla f({\bar{x}}) - \nabla f(x)] + q\) and \(\Vert q\Vert \le (\alpha + L_f \beta \lambda _{+}) \Vert d\Vert \). Furthermore, using Young’s inequality, the nonexpansiveness of the proximity operator and the Lipschitz continuity of \(\nabla f\), we obtain

$$\begin{aligned} \Vert p_+ - {\bar{x}} \Vert ^2&=\Vert {\mathrm {prox}}^{}_{\lambda _{+}\varphi }({\bar{x}} - \lambda _{+}\nabla f({\bar{x}}) - \lambda _{+}\theta ^{-1}[{\bar{x}} - x])- {\mathrm {prox}}^{}_{\lambda _+\varphi }(x+\alpha d - \lambda _+v_+)\Vert ^2\\&\le \Vert (1 - \lambda _{+}\theta ^{-1})[{\bar{x}} - x] - p + \lambda _{+}(v_+ - \nabla f(z))\Vert ^2 \\&= \Vert (1- \lambda _{+}\theta ^{-1})[{\bar{x}} - x] - \lambda _{+} [\nabla f({\bar{x}}) - \nabla f(x)]\Vert ^2 \\&\quad - 2\langle (1-\lambda _{+}\theta ^{-1})[{\bar{x}} - x], q \rangle - 2 \lambda _{+} \langle \nabla f({\bar{x}}) - \nabla f(x), q \rangle + \Vert q\Vert ^2 \\&\quad + 2 \lambda _{+} \langle (1-\lambda _{+}\theta ^{-1})[{\bar{x}} - x] -p, v_+ - \nabla f(z) \rangle + \lambda _{+}^2 \Vert \nabla f(z) - v_+\Vert ^2 \\&\le \left[ (1+\rho _1)(1 - {\lambda _{+}}{\theta }^{-1})^2 + 2\lambda _{+}(1 - {\lambda _{+}}{\theta }^{-1})L_f +(1+\rho _2)L_f^2\lambda _{+}^2 \right] \Vert {\bar{x}} - x\Vert ^2 \\&\quad + \left[ 1+\frac{1}{\rho _{1}}+\frac{1}{\rho _{2}} \right] \mu ^2\Vert d\Vert ^2+ \lambda _{+}^2 \Vert \nabla f(z) - v_+\Vert ^2 \\&\quad +2 \lambda _{+} \langle (1-\lambda _+\theta ^{-1})[{\bar{x}} - x] -p, v_+ - \nabla f(z) \rangle , \end{aligned}$$

where \(\mu = \alpha + L_f \beta \lambda _+\). This establishes the statement in Lemma 3.2. \(\square \)

1.2 9.2 Proof of Corollary  4.1

Proof

We need to verify that the choice of \(\lambda \) and \(\lambda _+\) satisfies the constraints derived in Theorem 4.1 and in (4.4), respectively. Notice that we can set \(\bar{\rho }= 0\) and we can work with \((\lambda _+^m)^{-1}\) instead of \((2\lambda _+^m)^{-1}\) in (4.4). Due to the definition of b and \(b_+\), we have \(\tau _m = \sqrt{2}\) for all m and thus, it follows

$$\begin{aligned} L_f - \frac{1}{\lambda _+} + \frac{K(K-1)}{2} \theta (\lambda _+)&\le L - \frac{1}{\lambda _+} + \frac{L^2 K^2}{2} \left[ \frac{1}{b_+} + \frac{1}{b} \right] \lambda _+ = 0. \end{aligned}$$

Furthermore, with the choice \(\lambda _+ = \gamma L^{-1}\) it holds that

$$\begin{aligned} {\mathcal {L}}_k^m&= \left[ \frac{(\alpha _k^m)^2}{\lambda _+} + 2L_f \alpha _k^m \beta _k^m + L_f^2 (\beta ^m_k)^2 \lambda _+ + \frac{L^2(\beta _k^m)^2}{K} \lambda _+ \right] (\nu ^m_k)^2 + \frac{1}{\lambda _+} \\&\le \left[ \frac{1}{\gamma } + 2 + \gamma + \frac{\gamma }{K}\right] L \bar{\nu }^2 + \frac{L}{\gamma } \le \frac{L}{\gamma }(1+3\bar{\nu }^2) \end{aligned}$$

and we have \({\mathcal {L}}^m_k \le \frac{1}{\lambda }\) for all k and m. Hence, the estimate (4.7) follows from (4.5).

Since each iteration in the inner loop of Algorithm 2 requires \(b+b_+\) gradient component evaluations (IFO), the total amount of evaluations in a single outer iteration is given by \(N + K(b + b_+)\) and the corresponding total number of gradient component evaluations after M outer iterations is

$$\begin{aligned} MK \cdot \left[ \frac{N}{K} + b+b_+ \right] = MK \cdot \left[ \frac{N}{K} + 2K^2 \right] . \end{aligned}$$

Moreover, since the bound for \(\mathbb {E}[\Vert F^1(\mathsf{X})\Vert ^2]\) in (4.7) is proportional to \((MK)^{-1}\), the IFO complexity of Algorithm 2 for reaching an \(\varepsilon \)-accurate stationary point with \(\mathbb {E}[\Vert F^1(\mathsf{X})\Vert ^2] \le \varepsilon \) is \({\mathcal {O}}((N/K + 2K^2)/\varepsilon )\). Minimizing this expression with respect to K yields \(K \sim N^{1/3}\) and the IFO complexity \({\mathcal {O}}(N^{2/3}/\varepsilon )\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, M., Milzarek, A., Wen, Z. et al. A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization. Math. Program. 194, 257–303 (2022). https://doi.org/10.1007/s10107-021-01629-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-021-01629-y

Keywords

Mathematics Subject Classification

Navigation