Abstract
In this work, we present probabilistic local convergence results for a stochastic semismooth Newton method for a class of stochastic composite optimization problems involving the sum of smooth nonconvex and nonsmooth convex terms in the objective function. We assume that the gradient and Hessian information of the smooth part of the objective function can only be approximated and accessed via calling stochastic first- and second-order oracles. The approach combines stochastic semismooth Newton steps, stochastic proximal gradient steps and a globalization strategy based on growth conditions. We present tail bounds and matrix concentration inequalities for the stochastic oracles that can be utilized to control the approximation errors via appropriately adjusting or increasing the sampling rates. Under standard local assumptions, we prove that the proposed algorithm locally turns into a pure stochastic semismooth Newton method and converges r-linearly or r-superlinearly with high probability.
Similar content being viewed by others
References
Agarwal N, Bullins B, Hazan E. Second-order stochastic optimization for machine learning in linear time. J Mach Learn Res, 2017, 18: 1–40
Asmussen S, Glynn P W. Stochastic Simulation: Algorithms and Analysis. Stochastic Modelling and Applied Probability, vol. 57. New York: Springer, 2007
Bach F, Jenatton R, Mairal J, et al. Optimization with sparsity-inducing penalties. Found Trends Mach Learn, 2011, 4: 1–106
Bauschke H H, Combettes P L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Cham: Springer, 2011
Bhattacharya R, Waymire E C. A Basic Course in Probability Theory, 2nd ed. Cham: Springer, 2016
Bollapragada R, Byrd R, Nocedal J. Exact and inexact subsampled Newton methods for optimization. IMA J Numer Anal, 2019, 39: 545–578
Bottou L, Curtis F E, Nocedal J. Optimization methods for large-scale machine learning. SIAM Rev, 2018, 60: 223–311
Byrd R H, Chin G M, Neveitt W, et al. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J Optim, 2011, 21: 977–995
Byrd R H, Chin G M, Nocedal J, et al. Sample size selection in optimization methods for machine learning. Math Program, 2012, 134: 127–155
Byrd R H, Hansen S L, Nocedal J, et al. A stochastic quasi-Newton method for large-scale optimization. SIAM J Optim, 2016, 26: 1008–1031
Clarke F H. Optimization and Nonsmooth Analysis, 2nd ed. Classics in Applied Mathematics, vol. 5. Philadelphia: SIAM, 1990
Combettes P L, Wajs V R. Signal recovery by proximal forward-backward splitting. Multiscale Model Simul, 2005, 4: 1168–1200
Dang C D, Lan G. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J Optim, 2015, 25: 856–881
Deng L, Yu D. Deep learning: Methods and applications. Found Trends Signal Process, 2014, 7: 197–387
Eisen M, Mokhtari A, Ribeiro A. Large scale empirical risk minimization via truncated adaptive Newton method. In: Proceedings of Machine Learning Research. International Conference on Artificial Intelligence and Statistics, vol. 84. Boston: Microtome Publishing, 2018, 1–9
Erdogdu M A, Montanari A. Convergence rates of sub-sampled Newton methods. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 2. Cambridge: MIT Press, 2015, 3052C3060
Facchinei F, Pang J-S. Finite-Dimensional Variational Inequalities and Complementarity Problems, Volume II. New York: Springer-Verlag, 2003
Fu M C. Optimization for simulation: Theory vs. practice. Informs J Comput, 2002, 14: 192–215
Ghadimi S, Lan G. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J Optim, 2013, 23: 2341–2368
Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program, 2016, 156: 59–99
Ghadimi S, Lan G, Zhang H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math Program, 2016, 155: 267–305
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, 2nd ed. Springer Series in Statistics. New York: Springer, 2009
Hiriart-Urruty J-B, Strodiot J-J, Nguyen V H. Generalized Hessian matrix and second-order optimality conditions for problems with C1,1 data. Appl Math Optim, 1984, 11: 43–56
Iusem A N, Jofré A, Oliveira R I, et al. Extragradient method with variance reduction for stochastic variational inequalities. SIAM J Optim, 2017, 27: 686–724
Juditsky A B, Nemirovski A S. Large deviations of vector-valued martingales in 2-smooth normed spaces. arXiv: 0809.0813, 2008
Koh K, Kim S-J, Boyd S. An interior-point method for large-scale l1-regularized logistic regression. J Mach Learn Res, 2007, 8: 1519–1555
Kohler J M, Lucchi A. Sub-sampled cubic regularization for non-convex optimization. In: Proceedings of Machine Learning Research. International Conference on Machine Learning, vol. 70. Cambridge: JMLR, 2017, 1895–1904
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436–444
Lee J D, Sun Y, Saunders M A. Proximal Newton-type methods for minimizing composite functions. SIAM J Optim, 2014, 24: 1420–1443
Mairal J, Bach F, Ponce J, et al. Online dictionary learning for sparse coding. In: Proceedings of the 26th Annual International Conference on Machine Learning. New York: Association for Computing Machinery, 2009, 689–696
Mason L, Baxter J, Bartlett P, et al. Boosting algorithms as gradient descent in function space. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. Denver: NIPS, 1999, 512–518
Meng F, Sun D, Zhao G. Semismoothness of solutions to generalized equations and the Moreau-Yosida regularization. Math Program, 2005, 104: 561–581
Milzarek A. Numerical methods and second order theory for nonsmooth problems. PhD Dissertation. München: Technische Universität München, 2016
Milzarek A, Ulbrich M. A semismooth Newton method with multidimensional filter globalization for l1-optimization. SIAM J Optim, 2014, 24: 298–333
Milzarek A, Xiao X, Cen S, et al. A stochastic semismooth Newton method for nonsmooth nonconvex optimization. SIAM J Optim, 2019, 29: 2916–2948
Moré J J, Sorensen D C. Computing a trust region step. SIAM J Sci Stat Comput, 1983, 4: 553–572
Moreau J-J. Proximité et dualité dans un espace hilbertien. Bull Soc Math France, 1965, 93: 273–299
Mutný M. Stochastic second-order optimization via von Neumann series. arXiv:1612.04694, 2016
Mutný M, Richtárik P. Parallel stochastic Newton method. J Comput Math, 2018, 36: 404–425
Pang J-S, Qi L. Nonsmooth equations: Motivation and algorithms. SIAM J Optim, 1993, 3: 443–465
Patrinos P, Stella L, Bemporad A. Forward-backward truncated Newton methods for large-scale convex composite optimization. arXiv:1402.6655, 2014
Pilanci M, Wainwright M J. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM J Optim, 2017, 27: 205–245
Qi L. Convergence analysis of some algorithms for solving nonsmooth equations. Math Oper Res, 1993, 18: 227–244
Qi L, Sun J. A nonsmooth version of Newton’s method. Math Program, 1993, 58: 353–367
Robbins H, Monro S. A stochastic approximation method. Ann Math Statist, 1951, 22: 400–407
Roosta-Khorasani F, Mahoney M W. Sub-sampled Newton methods. Math Program, 2019, 174: 293–326
Schmidhuber J. Deep learning in neural networks: An overview. Neural Netw, 2015, 61: 85–117
Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. New York: Cambridge University Press, 2014
Shalev-Shwartz S, Tewari A. Stochastic methods for ℓ1-regularized loss minimization. J Mach Learn Res, 2011, 12: 1865–1892
Shi J, Yin W, Osher S, et al. A fast hybrid algorithm for large-scale ℓ1-regularized logistic regression. J Mach Learn Res, 2010, 11: 713–741
Shi Z, Liu R. Large scale optimization with proximal stochastic Newton-type gradient descent. In: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 9284. Cham: Springer, 2015, 691–704
Tropp J A. User-friendly tail bounds for sums of random matrices. Found Comput Math, 2012, 12: 389–434
Wang J, Zhang T. Improved optimization of finite sums with minibatch stochastic variance reduced proximal iterations. In: Proceedings of 10th NIPS Workshop on Optimization for Machine Learning. Long Beach: NIPS, 2017, 1–6
Wang X, Ma S, Goldfarb D, et al. Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J Optim, 2017, 27: 927–956
Weisstein E W. Infinite product. From MathWorld—A Wolfram Web Resource, http://mathworld.wolfram.com/InfiniteProduct.html, 2017
Williams D. Probability with Martingales. Cambridge Mathematical Textbooks. Cambridge: Cambridge University Press, 1991
Xiao X, Li Y, Wen Z, et al. A regularized semi-smooth Newton method with projection steps for composite convex programs. J Sci Comput, 2018, 76: 364–389
Xu P, Roosta F, Mahoney M W. Newton-type methods for non-convex optimization under inexact Hessian information. Math Program, 2020, 184: 35–70
Xu P, Roosta F, Mahoney M W. Second-order optimization for non-convex machine learning: An empirical study. In: Proceedings of the 2020 SIAM International Conference on Data Mining. Philadelphia: SIAM, 2020, 199–207
Xu P, Yang J, Roosta-Khorasani F, et al. Sub-sampled Newton methods with non-uniform sampling. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. New York: Curran Associates, 2016, 3000–3008
Xu Y, Yin W. Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J Optim, 2015, 25: 1686–1716
Yao Z, Xu P, Roosta F, et al. Inexact non-convex Newton-type methods. Informs J Optim, 2021, 3: 154–182
Ye H, Luo L, Zhang Z. Approximate Newton methods and their local convergence. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70. Cambridge: PMLR, 2017, 3931–3939
Acknowledgements
This work was supported by the Fundamental Research Fund—Shenzhen Research Institute for Big Data Startup Fund (Grant No. JCYJ-AM20190601), the Shenzhen Institute of Artificial Intelligence and Robotics for Society, National Natural Science Foundation of China (Grant Nos. 11831002 and 11871135), the Key-Area Research and Development Program of Guangdong Province (Grant No. 2019B121204008) and Beijing Academy of Artificial Intelligence. The authors are grateful to the anonymous referees for their helpful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Milzarek, A., Xiao, X., Wen, Z. et al. On the local convergence of a stochastic semismooth Newton method for nonsmooth nonconvex optimization. Sci. China Math. 65, 2151–2170 (2022). https://doi.org/10.1007/s11425-020-1865-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11425-020-1865-1
Keywords
- nonsmooth stochastic optimization
- stochastic approximation
- semismooth Newton method
- stochastic second-order information
- local convergence