A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization

Yang, Minghan; Milzarek, Andre; Wen, Zaiwen; Zhang, Tong

doi:10.1007/s10107-021-01629-y

A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization

Full Length Paper
Series A
Published: 13 March 2021

Volume 194, pages 257–303, (2022)
Cite this article

Mathematical Programming Submit manuscript

Minghan Yang¹,
Andre Milzarek ORCID: orcid.org/0000-0002-6784-5417^2,3,4,
Zaiwen Wen^1,5 &
…
Tong Zhang⁶

2181 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, a novel stochastic extra-step quasi-Newton method is developed to solve a class of nonsmooth nonconvex composite optimization problems. We assume that the gradient of the smooth part of the objective function can only be approximated by stochastic oracles. The proposed method combines general stochastic higher order steps derived from an underlying proximal type fixed-point equation with additional stochastic proximal gradient steps to guarantee convergence. Based on suitable bounds on the step sizes, we establish global convergence to stationary points in expectation and an extension of the approach using variance reduction techniques is discussed. Motivated by large-scale and big data applications, we investigate a stochastic coordinate-type quasi-Newton scheme that allows to generate cheap and tractable stochastic higher order directions. Finally, numerical results on large-scale logistic regression and deep learning problems show that our proposed algorithm compares favorably with other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

Article Open access 26 July 2022

Learning to optimize: A tutorial for continuous and mixed-integer optimization

Article 08 May 2024

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

Article 11 May 2024

References

Agarwal, N., Bullins, B., Hazan, E.: Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res. 18(116), 1–40 (2017)
MathSciNet MATH Google Scholar
Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch SGD: training Resnet-50 on ImageNet in 15 minutes (2017). http://arxiv.org/abs/1711.04325
Allen-Zhu, Z.: Katyusha: The first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205 (2017)
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: Proceedings of the 33rd International Conference on Machine Learning, 699–707 (2016)
Andrew, G., Gao, J.: Scalable training of $\ell _1$-regularized log-linear models. In: Proceedings of the 24th International Conference on Machine Learning, 33–40 (2007)
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2011)
MATH Google Scholar
Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces. CMS books in mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York (2011)
Google Scholar
Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of Newton-sketch and subsampled Newton methods. Optim. Methods Softw. 35(4), 661–680 (2020). https://doi.org/10.1080/10556788.2020.1725751
Article MathSciNet MATH Google Scholar
Berahas, A.S., Nocedal, J., Takác, M.: A multi-batch L-BFGS method for machine learning. In: Advances in Neural Information Processing Systems, pp. 1063–1071 (2016)
Bishop, C.M.: Pattern recognition and machine learning. information science and statistics. Springer, New York (2006)
MATH Google Scholar
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal. 39, 1–34 (2018)
MathSciNet MATH Google Scholar
Botev, A., Ritter, H., Barber, D.: Practical Gauss-Newton optimization for deep learning. In: Proceedings of the 34th International Conference on Machine Learning, 557–565 (2017)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
MathSciNet MATH Google Scholar
Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–995 (2011)
MathSciNet MATH Google Scholar
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
MathSciNet MATH Google Scholar
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)
MathSciNet MATH Google Scholar
Censor, Y., Gibali, A., Reich, S.: The subgradient extragradient method for solving variational inequalities in hilbert space. J. Optim. Theor. Appl. 148(2), 318–335 (2011)
MathSciNet MATH Google Scholar
Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Willsky, A.S.: Sparse and low-rank matrix decompositions. In: 27th Annual Allerton Conference on Communication, Control and Computing, 42: 1493–1498 (2009)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM. Trans. Intell. Syst. Technol. 2(3), 27 (2011)
Google Scholar
Chen, X., Qi, L.: A parameterized Newton method and a quasi-Newton method for nonsmooth equations. Comput. Optim. Appl. 3(2), 157–179 (1994)
MathSciNet MATH Google Scholar
Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. in: fixed-point algorithms for inverse problems in science and engineering. Springer, New York (2011)
MATH Google Scholar
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4(4), 1168–1200 (2005)
MathSciNet MATH Google Scholar
Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust-region methods. MPS/SIAM series on optimization. SIAM. MPS, Philadelphia (2000)
Google Scholar
Davis, D., Drusvyatskiy, D.: Stochastic subgradient method converges at the rate O$(k^{-1/4})$ on weakly convex functions (2018). http://arxiv.org/abs/1802.02988
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019). https://doi.org/10.1137/18M1178244
Article MathSciNet MATH Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 1646–1654 (2014)
Deng, L., Yu, D.: Deep learning: methods and applications. Found. Trends Signal Process. 7, 197–387 (2014)
MathSciNet MATH Google Scholar
Dong, Y.: An extension of Luque’s growth condition. Appl. Math. Lett. 22(9), 1390–1393 (2009)
MathSciNet MATH Google Scholar
Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)
MathSciNet MATH Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Durrett, R.: Probability: theory and examples, vol. 49. Cambridge University Press, Cambridge (2019)
MATH Google Scholar
Erdogdu, M.A., Montanari, A.: Convergence rates of sub-sampled Newton methods. In: Advances in Neural Information Processing Systems, pp. 28 (2015)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Fang, C., Li, C.J., Lin, Z., Zhang, T.: SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 687–697 (2018)
Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
MathSciNet MATH Google Scholar
Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: Squeezing more curvature out of data. In: Proceedings of the 33rd International Conference on Machine Learning, 1869–1878 (2016)
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: Training ImageNet in 1 hour (2017). http://arxiv.org/abs/1706.02677
Grosse, R., Martens, J.: A Kronecker-factored approximate Fisher matrix for convolution layers. In: Proceedings of the 33rd International Conference on Machine Learning, 573–582 (2016)
Hastie, T., Tibshirani, R., Friedman, J.: Data mining, inference, and prediction. the elements of statistical learning Springer series in statistics. Springer-Verlag, New York (2001)
MATH Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hsieh, C.J., Sustik, M.A., Dhillon, I.S., Ravikumar, P.: QUIC: quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res. 15(1), 2911–2947 (2014)
MathSciNet MATH Google Scholar
Iusem, A.N., Jofré, A., Oliveira, R.I., Thompson, P.: Extragradient method with variance reduction for stochastic variational inequalities. SIAM J. Optim. 27(2), 686–724 (2017)
MathSciNet MATH Google Scholar
Janka, D., Kirches, C., Sager, S., Wächter, A.: An SR1/BFGS SQP algorithm for nonconvex nonlinear programs with block-diagonal Hessian matrix. Math. Program. Comput. 8(4), 435–459 (2016)
MathSciNet MATH Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. in Neural Inf. Process. Syst. 26, 315–323 (2013)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. http://arxiv.org/abs/1412.6980 (2014)
Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: Proceedings of the 34th International Conference on Machine Learning, 70:. 1895–1904 (2017)
Konečnỳ, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Topics in Signal Process. 10(2), 242–255 (2016)
Google Scholar
Korpelevich, G.: The extragradient method for finding saddle points and other problems. Matecon 12, 747–756 (1976)
MathSciNet MATH Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Google Scholar
Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014)
MathSciNet MATH Google Scholar
Lei, L., Ju, C., Chen, J., Jordan, M.I.: Non-convex finite-sum optimization via SCSG methods. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2345–2355 (2017)
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384–3392 (2015)
Lin, T., Ma, S., Zhang, S.: An extragradient-based alternating direction method for convex minimization. Found. Comput. Math. 17(1), 35–59 (2017)
MathSciNet MATH Google Scholar
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(3), 503–528 (1989)
MathSciNet MATH Google Scholar
Liu, H., So, A.M.C., Wu, W.: Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods. Math. Program. 178, 215–262 (2018)
MATH Google Scholar
Liu, X., Hsieh, C.J.: Fast variance reduction method with stochastic batch size. In: Proceedings of the 35th International Conference on Machine Learning, 3185–3194 (2018)
Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1–4), 157–178 (1993)
MathSciNet MATH Google Scholar
LIBLINEAR: A library for large linear classification. http://www.csie.ntu.edu.tw/~cjlin/liblinear
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: Proceedings of the 26th International Conference on Machine Learning, 689–696 (2009)
Mannel, F., Rund, A.: A hybrid semismooth quasi-Newton method for structured nonsmooth operator equations in Banach spaces (2018). https://imsc.uni-graz.at/mannel/sqn1.pdf
Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning, 27: 735–742 (2010)
Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-factored approximate curvature. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 2408–2417 (2015)
Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent in function space. In: Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 512–518 (1999)
Milzarek, A., Xiao, X., Cen, S., Wen, Z., Ulbrich, M.: A stochastic semismooth Newton method for nonsmooth nonconvex optimization. SIAM J. Optim. 29(4), 2916–2948 (2019)
MathSciNet MATH Google Scholar
Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: An incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670–1698 (2018)
MathSciNet MATH Google Scholar
Mokhtari, A., Ribeiro, A.: RES: regularized stochastic BFGS algorithm. IEEE Trans. Signal Process. 62(23), 6089–6104 (2014)
MathSciNet MATH Google Scholar
Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16, 3151–3181 (2015)
MathSciNet MATH Google Scholar
Monteiro, R.D., Svaiter, B.F.: Complexity of variants of tseng’s modified fb splitting and korpelevich’s methods for hemivariational inequalities with applications to saddle-point and convex optimization problems. SIAM J. Optim. 21(4), 1688–1720 (2011)
MathSciNet MATH Google Scholar
Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. FR. 93, 273–299 (1965)
MATH Google Scholar
Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Proceedings of the 19th Conference on Artificial Intelligence and Statistics, pp. 249–258 (2016)
Mutný, M.: Stochastic second-order optimization via Neumann series (2016). http://arxiv.org/abs/1612.04694
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
MathSciNet MATH Google Scholar
Nguyen, L.M., van Dijk, M., Phan, D.T., Nguyen, P.H., Weng, T.W., Kalagnanam, J.R.: Finite-sum smooth optimization with SARAH (2019). http://arxiv.org/abs/1901.07648v2
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: A novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, 2613–2621 (2017)
Nguyen, T.P., Pauwels, E., Richard, E., Suter, B.W.: Extragradient method in optimization: convergence and complexity. J. Optim. Theory Appl. 176(1), 137–162 (2018)
MathSciNet MATH Google Scholar
Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comp. 35(151), 773–782 (1980)
MathSciNet MATH Google Scholar
Osawa, K., Tsuji, Y., Ueno, Y., Naruse, A., Yokota, R., Matsuoka, S.: Large-scale distributed second-order optimization using Kronecker-factored approximate curvature for deep convolutional neural networks (2018). http://arxiv.org/abs/1811.12019
Pang, J.S., Qi, L.: Nonsmooth equations: motivation and algorithms. SIAM J. Optim. 3(3), 443–465 (1993)
MathSciNet MATH Google Scholar
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in Pytorch. In: Proceedings of the 31th International Conference on Neural Information Processing Systems (2017)
Patrinos, P., Stella, L., Bemporad, A.: Forward-backward truncated Newton methods for convex composite optimization (2014). http://arxiv.org/abs/1402.6655
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 1–48 (2020)
MathSciNet MATH Google Scholar
Pilanci, M., Wainwright, M.J.: Newton sketch: a near linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optim. 27(1), 205–245 (2017)
MathSciNet MATH Google Scholar
Poon, C., Liang, J., Schoenlieb, C.: Local convergence properties of SAGA/Prox-SVRG and acceleration. In: Proceedings of the 35th International Conference on Machine Learning, 80: 4124–4132 (2018)
Qi, L.: Convergence analysis of some algorithms for solving nonsmooth equations. Math. Oper. Res. 18(1), 227–244 (1993)
MathSciNet MATH Google Scholar
Qi, L.: On superlinear convergence of quasi-Newton methods for nonsmooth equations. Oper. Res. Lett. 20(5), 223–228 (1997)
MathSciNet MATH Google Scholar
Qi, L., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58(3), 353–367 (1993)
MathSciNet MATH Google Scholar
Reddi, S.J., Hefny, A., Sra, S., Póczos, B., Smola, A.J.: Stochastic variance reduction for nonconvex optimization. In: Proceedings of the 33th International Conference on Machine Learning, 314–323 (2016)
Reddi, S.J., Sra, S., Póczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, 1145–1153 (2016)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
MathSciNet MATH Google Scholar
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Optimizing Methods in Statistics, pp. 233–257. Academic Press (1971)
Rodomanov, A., Kropotov, D.: A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In: Proceeding of the 33rd International Conference on Machine Learning, 2597–2605 (2016)
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods. Math. Program. 76, 1–34 (2018)
MATH Google Scholar
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
MathSciNet MATH Google Scholar
Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-Newton method for online convex optimization. In: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, 436–443 (2007)
Shalev-Shwartz, S., Ben-David, S.: Understanding machine learning: from theory to algorithms. Cambridge University Press, Cambridge (2014)
MATH Google Scholar
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for $\ell _1$-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155(1–2), 105–145 (2016)
MathSciNet MATH Google Scholar
Shi, J., Yin, W., Osher, S., Sajda, P.: A fast hybrid algorithm for large-scale $\ell _1$-regularized logistic regression. J. Mach. Learn. Res. 11, 713–741 (2010)
MathSciNet MATH Google Scholar
Shi, Z., Liu, R.: Large scale optimization with proximal stochastic Newton-type gradient descent In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 691–704. Springer, Cham (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. http://arxiv.org/abs/1409.1556 (2014)
Stella, L., Themelis, A., Patrinos, P.: Forward-backward quasi-Newton methods for nonsmooth optimization problems. Comput. Optim. Appl. 67(3), 443–487 (2017)
MathSciNet MATH Google Scholar
Sun, D., Han, J.: Newton and quasi-Newton methods for a class of nonsmooth equations and related problems. SIAM J. Optim. 7(2), 463–480 (1997)
MathSciNet MATH Google Scholar
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Macine Learning, 1139–1147 (2013)
Themelis, A., Stella, L., Patrinos, P.: Forward-backward envelope for the sum of two nonconvex functions: further properties and nonmonotone linesearch algorithms. SIAM J. Optim. 28(3), 2274–2303 (2018)
MathSciNet MATH Google Scholar
Vapnik, V.: The nature of statistical learning theory. Springer Science and Business Media, New York (2013)
MATH Google Scholar
Wang, J., Zhang, T.: Utilizing second order information in minibatch stochastic variance reduced proximal iterations. J. Mach. Learn. Res. 20(42), 1–56 (2019)
MathSciNet MATH Google Scholar
Wang, X., Ma, C., Li, M.: A globally and superlinearly convergent quasi-Newton method for general box constrained variational inequalities without smoothing approximation. J. Global Optim. 50(4), 675–694 (2011)
MathSciNet MATH Google Scholar
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization. SIAM J. Optim. 27(2), 927–956 (2017)
MathSciNet MATH Google Scholar
Wang, X., Yuan, Y.X.: Stochastic proximal quasi-Newton methods for non-convex composite optimization. Optim. Methods Softw. 34, 922–948 (2019)
MathSciNet MATH Google Scholar
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost and momentum: faster stochastic variance reduction algorithms. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)
Wen, Z., Yin, W., Goldfarb, D., Zhang, Y.: A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization, and continuation. SIAM J. Sci. Comput. 32(4), 1832–1857 (2010)
MathSciNet MATH Google Scholar
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)
MathSciNet MATH Google Scholar
Xiao, X., Li, Y., Wen, Z., Zhang, L.: A regularized semi-smooth Newton method with projection steps for composite convex programs. J. Sci. Comput. 76(1), 364–389 (2018)
MathSciNet MATH Google Scholar
Xu, P., Roosta, F., Mahoney, M.W.: Newton-type methods for non-convex optimization under inexact Hessian information. Math. Program. 184, 35–70 (2019)
MathSciNet MATH Google Scholar
Xu, P., Yang, J., Roosta-Khorasani, F., Ré, C., Mahoney, M.W.: Sub-sampled Newton methods with non-uniform sampling. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3008–3016 (2016)
Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)
MathSciNet MATH Google Scholar
Ye, H., Luo, L., Zhang, Z.: Approximate Newton methods and their local convergence. In: Proceedings of the 34th International Conference on Machine Learning, 70: 3931–3939 (2017)
You, Y., Zhang, Z., Hsieh, C.J., Demmel, J., Keutzer, K.: ImageNet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, 1–10 (2018)
Yuan, G.X., Ho, C.H., Lin, C.J.: An improved GLMNET for $\ell _1$-regularized logistic regression. J. Mach. Learn. Res. 13, 1999–2030 (2012)
MathSciNet MATH Google Scholar
Zhang, H., Reddi, S.J., Sra, S.: Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems, 4592–4600 (2016)
Zhao, R., Haskell, W.B., Tan, V.Y.: Stochastic L-BFGS: improved convergence rates and practical acceleration strategies. IEEE Trans. Signal Process 66, 1155–1169 (2017)
MathSciNet MATH Google Scholar
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. J. Mach. Learn. Res. 21, 1–63 (2018)

Download references

Acknowledgements

The authors are grateful to the associate editor and two anonymous referees for their valuable comments and suggestions.

Author information

Authors and Affiliations

Beijing International Center for Mathematical Research, BICMR, Peking University, Beijing, China
Minghan Yang & Zaiwen Wen
School of Data Science SDS, The Chinese University of Hong Kong - Shenzhen, Shenzhen, Guangdong, China
Andre Milzarek
Shenzhen Research Institute of Big Data, SRIBD, Shenzhen, Guangdong, China
Andre Milzarek
Shenzhen Institute of Artificial Intelligence and Robotics for Society, AIRS, Shenzhen, Guangdong, China
Andre Milzarek
Center for Data Science, Peking University, Beijing, China
Zaiwen Wen
Hong Kong University of Science and Technology, Hong Kong, China
Tong Zhang

Authors

Minghan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Andre Milzarek
View author publications
You can also search for this author in PubMed Google Scholar
Zaiwen Wen
View author publications
You can also search for this author in PubMed Google Scholar
Tong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andre Milzarek.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

M. Yang and Z. Wen are partly supported by Key-Area Research and Development Program of Guangdong Province (No. 2019B121204008), the NSFC grant 11831002 and by Beijing Academy of Artificial Intelligence.

A. Milzarek is partly supported by the Fundamental Research Fund – Shenzhen Research Institute of Big Data (SRIBD) Startup Fund JCYJ-AM20190601 and by the Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS)

9 Appendix: proofs of auxiliary results

1.1 9.1 Proof of Lemma 3.2

Proof

As in [25, Lemma 3.2], applying the definition and characterization (2.3) of the proximal operator, it follows

$$\begin{aligned} {\bar{x}} = {\mathrm {prox}}^{}_{\theta \psi }(x) \,&\iff \, {\bar{x}} \in x - \theta [\nabla f({\bar{x}}) + \partial \varphi ({\bar{x}})] \\&\iff \, {\bar{x}}\in {\bar{x}} - \lambda _{+}[ \nabla f({\bar{x}}) + \theta ^{-1} ({\bar{x}} - x)] - \lambda _{+} \partial \varphi ({\bar{x}}) \\&\iff \, {\bar{x}} = {\mathrm {prox}}^{}_{\lambda _{+}\varphi }({\bar{x}} - \lambda _{+}\nabla f({\bar{x}}) - \lambda _{+}\theta ^{-1}[{\bar{x}} - x]). \end{aligned}$$

Now, setting $z := x + \beta d$ and $q := \alpha d + \lambda _{+}(\nabla f(x) - \nabla f(z))$, we have $p = \lambda _{+}[\nabla f({\bar{x}}) - \nabla f(x)] + q$ and $\Vert q\Vert \le (\alpha + L_f \beta \lambda _{+}) \Vert d\Vert $. Furthermore, using Young’s inequality, the nonexpansiveness of the proximity operator and the Lipschitz continuity of $\nabla f$, we obtain

$$\begin{aligned} \Vert p_+ - {\bar{x}} \Vert ^2&=\Vert {\mathrm {prox}}^{}_{\lambda _{+}\varphi }({\bar{x}} - \lambda _{+}\nabla f({\bar{x}}) - \lambda _{+}\theta ^{-1}[{\bar{x}} - x])- {\mathrm {prox}}^{}_{\lambda _+\varphi }(x+\alpha d - \lambda _+v_+)\Vert ^2\\&\le \Vert (1 - \lambda _{+}\theta ^{-1})[{\bar{x}} - x] - p + \lambda _{+}(v_+ - \nabla f(z))\Vert ^2 \\&= \Vert (1- \lambda _{+}\theta ^{-1})[{\bar{x}} - x] - \lambda _{+} [\nabla f({\bar{x}}) - \nabla f(x)]\Vert ^2 \\&\quad - 2\langle (1-\lambda _{+}\theta ^{-1})[{\bar{x}} - x], q \rangle - 2 \lambda _{+} \langle \nabla f({\bar{x}}) - \nabla f(x), q \rangle + \Vert q\Vert ^2 \\&\quad + 2 \lambda _{+} \langle (1-\lambda _{+}\theta ^{-1})[{\bar{x}} - x] -p, v_+ - \nabla f(z) \rangle + \lambda _{+}^2 \Vert \nabla f(z) - v_+\Vert ^2 \\&\le \left[ (1+\rho _1)(1 - {\lambda _{+}}{\theta }^{-1})^2 + 2\lambda _{+}(1 - {\lambda _{+}}{\theta }^{-1})L_f +(1+\rho _2)L_f^2\lambda _{+}^2 \right] \Vert {\bar{x}} - x\Vert ^2 \\&\quad + \left[ 1+\frac{1}{\rho _{1}}+\frac{1}{\rho _{2}} \right] \mu ^2\Vert d\Vert ^2+ \lambda _{+}^2 \Vert \nabla f(z) - v_+\Vert ^2 \\&\quad +2 \lambda _{+} \langle (1-\lambda _+\theta ^{-1})[{\bar{x}} - x] -p, v_+ - \nabla f(z) \rangle , \end{aligned}$$

where $\mu = \alpha + L_f \beta \lambda _+$. This establishes the statement in Lemma 3.2. $\square $

1.2 9.2 Proof of Corollary 4.1

Proof

We need to verify that the choice of $\lambda $ and $\lambda _+$ satisfies the constraints derived in Theorem 4.1 and in (4.4), respectively. Notice that we can set $\bar{\rho }= 0$ and we can work with $(\lambda _+^m)^{-1}$ instead of $(2\lambda _+^m)^{-1}$ in (4.4). Due to the definition of b and $b_+$, we have $\tau _m = \sqrt{2}$ for all m and thus, it follows

$$\begin{aligned} L_f - \frac{1}{\lambda _+} + \frac{K(K-1)}{2} \theta (\lambda _+)&\le L - \frac{1}{\lambda _+} + \frac{L^2 K^2}{2} \left[ \frac{1}{b_+} + \frac{1}{b} \right] \lambda _+ = 0. \end{aligned}$$

Furthermore, with the choice $\lambda _+ = \gamma L^{-1}$ it holds that

$$\begin{aligned} {\mathcal {L}}_k^m&= \left[ \frac{(\alpha _k^m)^2}{\lambda _+} + 2L_f \alpha _k^m \beta _k^m + L_f^2 (\beta ^m_k)^2 \lambda _+ + \frac{L^2(\beta _k^m)^2}{K} \lambda _+ \right] (\nu ^m_k)^2 + \frac{1}{\lambda _+} \\&\le \left[ \frac{1}{\gamma } + 2 + \gamma + \frac{\gamma }{K}\right] L \bar{\nu }^2 + \frac{L}{\gamma } \le \frac{L}{\gamma }(1+3\bar{\nu }^2) \end{aligned}$$

and we have ${\mathcal {L}}^m_k \le \frac{1}{\lambda }$ for all k and m. Hence, the estimate (4.7) follows from (4.5).

Since each iteration in the inner loop of Algorithm 2 requires $b+b_+$ gradient component evaluations (IFO), the total amount of evaluations in a single outer iteration is given by $N + K(b + b_+)$ and the corresponding total number of gradient component evaluations after M outer iterations is

$$\begin{aligned} MK \cdot \left[ \frac{N}{K} + b+b_+ \right] = MK \cdot \left[ \frac{N}{K} + 2K^2 \right] . \end{aligned}$$

Moreover, since the bound for $\mathbb {E}[\Vert F^1(\mathsf{X})\Vert ^2]$ in (4.7) is proportional to $(MK)^{-1}$, the IFO complexity of Algorithm 2 for reaching an $\varepsilon $-accurate stationary point with $\mathbb {E}[\Vert F^1(\mathsf{X})\Vert ^2] \le \varepsilon $ is ${\mathcal {O}}((N/K + 2K^2)/\varepsilon )$. Minimizing this expression with respect to K yields $K \sim N^{1/3}$ and the IFO complexity ${\mathcal {O}}(N^{2/3}/\varepsilon )$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, M., Milzarek, A., Wen, Z. et al. A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization. Math. Program. 194, 257–303 (2022). https://doi.org/10.1007/s10107-021-01629-y

Download citation

Received: 22 October 2019
Accepted: 10 February 2021
Published: 13 March 2021
Issue Date: July 2022
DOI: https://doi.org/10.1007/s10107-021-01629-y

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization

Abstract

Access this article

Similar content being viewed by others

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

Learning to optimize: A tutorial for continuous and mixed-integer optimization

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

9 Appendix: proofs of auxiliary results

1.1 9.1 Proof of Lemma 3.2

Proof

1.2 9.2 Proof of Corollary 4.1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization

Abstract

Access this article

Similar content being viewed by others

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

Learning to optimize: A tutorial for continuous and mixed-integer optimization

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

9 Appendix: proofs of auxiliary results

9 Appendix: proofs of auxiliary results

1.1 9.1 Proof of Lemma 3.2

Proof

1.2 9.2 Proof of Corollary 4.1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation