Newton-type methods for non-convex optimization under inexact Hessian information

Abstract

We consider variants of trust-region and adaptive cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under certain condition on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve \(\varepsilon \)-approximate second-order optimality which have been shown to be tight. Our Hessian approximation condition offers a range of advantages as compared with the prior works and allows for direct construction of the approximate Hessian with a priori guarantees through various techniques, including randomized sampling methods. In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and adaptive cubic regularization methods.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent (2016). ArXiv preprint arXiv:1611.01146

  2. 2.

    Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24(3), 1238–1264 (2014)

    MathSciNet  MATH  Article  Google Scholar 

  3. 3.

    Beaton, A.E., Tukey, J.W.: The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16(2), 147–185 (1974)

    MATH  Article  Google Scholar 

  4. 4.

    Bianconcini, T., Liuzzi, G., Morini, B., Sciandrone, M.: On the use of iterative methods in cubic regularization for unconstrained optimization. Comput. Optim. Appl. 60(1), 35–57 (2015)

    MathSciNet  MATH  Article  Google Scholar 

  5. 5.

    Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust region method for nonconvex optimization (2018). ArXiv preprint arXiv:1609.07428v3

  6. 6.

    Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization (2016). ArXiv preprint arXiv:1609.08502

  7. 7.

    Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016). ArXiv preprint arXiv:1606.04838

  8. 8.

    Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step (2016). ArXiv preprint arXiv:1612.00547

  9. 9.

    Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)

    MathSciNet  MATH  Article  Google Scholar 

  10. 10.

    Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)

    MathSciNet  MATH  Article  Google Scholar 

  11. 11.

    Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 130(2), 295–319 (2011)

    MathSciNet  MATH  Article  Google Scholar 

  12. 12.

    Cartis, C., Gould, N.I., Toint, P.L.: Optimal Newton-type methods for nonconvex smooth optimization problems. Tech. rep., ERGO technical report 11-009, School of Mathematics, University of Edinburgh (2011)

  13. 13.

    Cartis, C., Gould, N.I., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012)

    MathSciNet  MATH  Article  Google Scholar 

  14. 14.

    Cartis, C., Gould, N.I., Toint, P.L.: Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optim. Methods Softw. 27(2), 197–219 (2012)

    MathSciNet  MATH  Article  Google Scholar 

  15. 15.

    Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 169(2), 337–375 (2018)

    MathSciNet  MATH  Article  Google Scholar 

  16. 16.

    Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models (2015). ArXiv preprint arXiv:1504.04231

  17. 17.

    Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods. SIAM, Philadelphia (2000)

    MATH  Book  Google Scholar 

  18. 18.

    Conn, A.R., Scheinberg, K., Vicente, L.N.: Global convergence of general derivative-free trust-region algorithms to first-and second-order critical points. SIAM J. Optim. 20(1), 387–415 (2009)

    MathSciNet  MATH  Article  Google Scholar 

  19. 19.

    Curtis, F.E., Robinson, D.P., Samadi, M.: An inexact regularized Newton framework with a worst-case iteration complexity of \({\cal{O}}(\varepsilon ^{-3/2}) \) for nonconvex optimization (2017). ArXiv preprint arXiv:1708.00475

  20. 20.

    Curtis, F.E., Robinson, D.P., Samadi, M.: A trust region algorithm with a worst-case iteration complexity of \( \cal{O}(\varepsilon ^{-3/2}) \) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2017)

    MathSciNet  MATH  Article  Google Scholar 

  21. 21.

    Dennis, J.E., Moré, J.J.: A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28(126), 549–560 (1974)

    MathSciNet  MATH  Article  Google Scholar 

  22. 22.

    Drineas, P., Mahoney, M.W.: RandNLA: randomized numerical linear algebra. Commun. ACM 59(6), 80–90 (2016)

    Article  Google Scholar 

  23. 23.

    Erway, J.B., Gill, P.E., Griffin, J.D.: Iterative methods for finding a trust-region step. SIAM J. Optim. 20(2), 1110–1131 (2009)

    MathSciNet  MATH  Article  Google Scholar 

  24. 24.

    Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, Berlin (2001)

    MATH  Google Scholar 

  25. 25.

    Garber, D., Hazan, E., Jin, C., Musco, C., Netrapalli, P., Sidford, A., et al.: Faster eigenvector computation via shift-and-invert preconditioning. In: International Conference on Machine Learning, pp. 2626–2634 (2016)

  26. 26.

    Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points-online stochastic gradient for tensor decomposition. In: COLT, pp. 797–842 (2015)

  27. 27.

    Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information (2017). ArXiv preprint arXiv:1710.05782

  28. 28.

    Gould, N.I., Lucidi, S., Roma, M., Toint, P.L.: Solving the trust-region subproblem using the Lanczos method. SIAM J. Optim. 9(2), 504–525 (1999)

    MathSciNet  MATH  Article  Google Scholar 

  29. 29.

    Gould, N.I., Robinson, D.P., Thorne, H.S.: On solving trust-region and other regularised subproblems in optimization. Math. Program. Comput. 2(1), 21–57 (2010)

    MathSciNet  MATH  Article  Google Scholar 

  30. 30.

    Grapiglia, G.N., Yuan, J., Yuan, Y.: On the worst-case complexity of nonlinear stepsize control algorithms for convex unconstrained optimization. Optim. Methods Softw. 31(3), 591–604 (2016)

    MathSciNet  MATH  Article  Google Scholar 

  31. 31.

    Gratton, S., Mouffe, M., Toint, P.L., Weber-Mendonça, M.: A recursive-trust-region method for bound-constrained nonlinear optimization. IMA J. Numer. Anal. 28(4), 827–861 (2008)

    MathSciNet  MATH  Article  Google Scholar 

  32. 32.

    Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numer. Anal. 38(3), 1579–1597 (2018)

    MathSciNet  MATH  Article  Google Scholar 

  33. 33.

    Gratton, S., Sartenaer, A., Toint, P.L.: Recursive trust-region methods for multiscale nonlinear optimization. SIAM J. Optim. 19(1), 414–444 (2008)

    MathSciNet  MATH  Article  Google Scholar 

  34. 34.

    Griewank, A.: The modification of Newton’s method for unconstrained optimization by bounding cubic terms. Technical Report NA/12. Department of Applied Mathematics and Theoretical Physics, University of Cambridge (1981)

  35. 35.

    Gross, D., Nesme, V.: Note on sampling without replacing from a finite collection of matrices (2010). ArXiv preprint arXiv:1001.2738

  36. 36.

    Hazan, E., Koren, T.: A linear-time algorithm for trust region problems. Math. Program. 158(1–2), 363–381 (2016)

    MathSciNet  MATH  Article  Google Scholar 

  37. 37.

    Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization (2017). ArXiv preprint arXiv:1705.05933

  38. 38.

    Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)

    MathSciNet  MATH  Article  Google Scholar 

  39. 39.

    Larson, J., Billups, S.C.: Stochastic derivative-free optimization using a trust region framework. Comput. Optim. Appl. 64(3), 619–645 (2016)

    MathSciNet  MATH  Article  Google Scholar 

  40. 40.

    Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Conference on Learning Theory, pp. 1246–1257 (2016)

  41. 41.

    Lenders, F., Kirches, C., Potschka, A.: trlib: a vector-free implementation of the GLTR method for iterative solution of the trust region problem (2016). ArXiv preprint arXiv:1611.04718

  42. 42.

    Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends® Mach. Learn. 3(2), 123–224 (2011)

    MATH  Google Scholar 

  43. 43.

    Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM J. Sci. Stat. Comput. 4(3), 553–572 (1983)

    MathSciNet  MATH  Article  Google Scholar 

  44. 44.

    Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)

    MathSciNet  MATH  Article  Google Scholar 

  45. 45.

    Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)

    MathSciNet  MATH  Article  Google Scholar 

  46. 46.

    Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)

    MATH  Google Scholar 

  47. 47.

    Roosta-Khorasani, F., van den Doel, K., Ascher, U.: Data completion and stochastic algorithms for PDE inversion problems with many measurements. Electron. Trans. Numer. Anal. 42, 177–196 (2014)

    MathSciNet  MATH  Google Scholar 

  48. 48.

    Roosta-Khorasani, F., van den Doel, K., Ascher, U.: Stochastic algorithms for inverse problems involving PDEs and many measurements. SIAM J. Sci. Comput. 36(5), S3–S22 (2014)

    MathSciNet  MATH  Article  Google Scholar 

  49. 49.

    Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods. Math. Program. 174(1), 293–326 (2019)

    MathSciNet  MATH  Article  Google Scholar 

  50. 50.

    Roosta-Khorasani, F., Székely, G.J., Ascher, U.: Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables. SIAM/ASA J. Uncertain. Quantif. 3(1), 61–90 (2015)

    MathSciNet  MATH  Article  Google Scholar 

  51. 51.

    Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)

    MATH  Book  Google Scholar 

  52. 52.

    Shashaani, S., Hashemi, F., Pasupathy, R.: ASTRO-DF: a class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization (2016). ArXiv preprint arXiv:1610.06506

  53. 53.

    Sorensen, D.: Minimization of a large-scale quadratic functionsubject to a spherical constraint. SIAM J. Optim. 7(1), 141–161 (1997)

    MathSciNet  MATH  Article  Google Scholar 

  54. 54.

    Sorensen, D.C.: Newton’s method with a model trust region modification. SIAM J. Numer. Anal. 19(2), 409–426 (1982)

    MathSciNet  MATH  Article  Google Scholar 

  55. 55.

    Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)

    Google Scholar 

  56. 56.

    Steihaug, T.: The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numer. Anal. 20(3), 626–637 (1983)

    MathSciNet  MATH  Article  Google Scholar 

  57. 57.

    Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization (2017). ArXiv preprint arXiv:1711.02838

  58. 58.

    Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)

    MathSciNet  MATH  Article  Google Scholar 

  59. 59.

    Tropp, J.A.: An introduction to matrix concentration inequalities (2015). ArXiv preprint arXiv:1501.01571

  60. 60.

    Woodruff, D.P.: Sketching as a tool for numerical linear algebra (2014). ArXiv preprint arXiv:1411.4357

  61. 61.

    Xu, P., Roosta-Khorasani, F., Mahoney, M.W.: Second-order optimization for non-convex machine learning: an empirical study (2017). ArXiv preprint arXiv:1708.07827

  62. 62.

    Xu, P., Yang, J., Roosta-Khorasani, F., Ré, C., Mahoney, M.W.: Sub-sampled Newton methods with non-uniform sampling. In: Advances in Neural Information Processing Systems, pp. 3000–3008 (2016)

Download references

Acknowledgements

Fred Roosta gratefully acknowledges the generous support by the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). He was also partially supported by the Australian Research Council through a Discovery Early Career Researcher Award (DE180100923). Michael W. Mahoney would like to acknowledge ARO, DARPA, and NSF for providing partial support of this work.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Fred Roosta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Intrinsic dimension and improving the sampling complexity (24)

We can still improve the sampling complexity (24) by considering the intrinsic dimension of the matrix \(\mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\). Recall that for a SPSD matrix \( \mathbf {A}\in {\mathbb {R}}^{d \times d}\), the intrinsic dimension is defined as \( t(A) = \text {tr}(\mathbf {A})/\Vert \mathbf {A}\Vert \), where \( \text {tr}(\mathbf {A}) \) is the trace of \( \mathbf {A}\). The intrinsic dimension can be regarded as a measure for the number of dimensions where \(\mathbf {A}\) has a significant spectrum. It is easy to see that \( 1 \le t(\mathbf {A}) \le \text {rank}(\mathbf {A}) \le d \); see [59] for more details. Now let \(t = \text {tr}(\mathbf {A}^{T} |\mathbf {B}| \mathbf {A})/\Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert \) be the intrinsic dimension of the SPSD matrix \( \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\). We have the following improved sampling complexity result:

Lemma 18

(Complexity of non-uniform sampling: intrinsic dimension) The result of Lemma 17 holds with (24) replaced with

$$\begin{aligned} |{\mathscr {S}}| \ge \frac{16 \widehat{K}^2}{3\varepsilon ^2}\log \frac{8t}{\delta }, \end{aligned}$$
(26)

where \(t = \text {tr}(\mathbf {A}^{T} |\mathbf {B}| \mathbf {A})/\Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert \le d\) is the intrinsic dimension of the matrix \( \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\).

Proof

It is easy to see that \(\text {Var}(\mathbf {X}) = {\mathbb {E}}(\mathbf {X}^{2}) = \sum _{j=1}^{|{\mathscr {S}}|} {\mathbb {E}}(\mathbf {X}_{j}^{2}) \preceq |{\mathscr {S}}| c \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\), where \( \mathbf {X}\) and c are given in the proof of Lemma 17. For \(\mathbf {H}_{j} = \frac{b_{i}}{p_{i}} \mathbf {a}_{i} \mathbf {a}_{i}^{T}\), we have

$$\begin{aligned} \lambda _{\max }(\mathbf {X}_{j})&\le \Vert \mathbf {X}_{j}\Vert = \Vert \frac{b_{i}}{p_{i}} \mathbf {a}_{i} \mathbf {a}_{i}^{T} - \mathbf {A}^{T} \mathbf {B}\mathbf {A}\Vert = \left\| \left( \frac{1-p_{i}}{p_{i}}\right) b_{i} \mathbf {a}_{i} \mathbf {a}_{i}^{T} - \sum _{j \ne i} b_{j} \mathbf {a}_{j} \mathbf {a}_{j}^{T} \right\| \\&\le \left( \frac{1-p_{i}}{p_{i}}\right) |b_{i}| \Vert \mathbf {a}_{i}\Vert ^{2} + \sum _{j \ne i} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} \\&= \left( 1-p_{i}\right) \sum _{i=1}^{n} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} + \sum _{j \ne i} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} \\&= 2 \sum _{j \ne i} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} \le 2 \sum _{i = 1}^{n} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} = 2 c. \end{aligned}$$

Hence, if \(\varepsilon |{\mathscr {S}}| \ge \sqrt{|{\mathscr {S}}| c \Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert } + 2 c/3\), we can apply Matrix Bernstein using the intrinsic dimension [59, Theorem 7.7.1] to get for \(\varepsilon \le 1/2\)

$$\begin{aligned} \Pr \left( \lambda _{\max }(\mathbf {X}) \ge \varepsilon |{\mathscr {S}}| \right)&\le 4 t \exp \left\{ \frac{-\varepsilon ^2 |{\mathscr {S}}|}{2 c \Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert + 4 c \varepsilon /3}\right\} \le 4 t \exp \left\{ \frac{-3 \varepsilon ^2 |{\mathscr {S}}|}{16 c^{2}}\right\} . \end{aligned}$$

Applying the same bound for \(\mathbf {Y}_{j} = - \mathbf {X}_{j}\) and \(\mathbf {Y}= \sum _{j=1}^{s} \mathbf {Y}_{j}\), followed by the union bound, we get the desired result. \(\square \)

Appendix B: Computation of Approximate Negative Curvature Direction

Throughout our analysis, we assume that, if a sufficiently negative curvature exists, i.e., \( \lambda _{\min }(\mathbf {H}) \le -\varepsilon _{H}\) for some \( \varepsilon _{H} \in (0,1) \), we can approximately compute the corresponding negative curvature direction vector \( \mathbf {u}\), i.e., \(\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le - \nu \varepsilon _{H} \Vert \mathbf {u}\Vert ^{2}\), for some \( \nu \in (0,1) \). We note that this can be done efficiently by applying a variety of methods such as Lanczos [38] or shift-and-invert [25] on the SPSD matrix \( {\tilde{\mathbf {H}}} = K_{H} - \mathbf {H}\). These methods only employ matrix vector products and, hence, are suitable for large scale problems. More specifically, with any \( \kappa \in (0,1) \), these methods using \( {\mathscr {O}}(\log (d/\delta )\sqrt{K_{H}/\kappa }) \) matrix-vector products and with probability \( 1-\delta \), yield a vector \( \mathbf {u}\) satisfying \(K_{H}\Vert \mathbf {u}\Vert ^{2} - \langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle = \langle \mathbf {u}, {\tilde{\mathbf {H}}} \mathbf {u}\rangle \ge \kappa \lambda _{\min }({\tilde{\mathbf {H}}}) \Vert \mathbf {u}\Vert ^{2} = \kappa (K_{H} - \lambda _{\min }(\mathbf {H})) \Vert \mathbf {u}\Vert ^{2}\). Rearranging, we obtain \(\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le (1-\kappa ) K_{H} \Vert \mathbf {u}\Vert ^{2} + \kappa \lambda _{\min }(\mathbf {H}) \Vert \mathbf {u}\Vert ^{2}\). Setting \(1 > \nu = 2\kappa \ge (2 K_{H})/(2 K_{H} + \varepsilon _{H})\), gives \(\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le -\nu \varepsilon _{H} \Vert \mathbf {u}\Vert ^{2}\).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Xu, P., Roosta, F. & Mahoney, M.W. Newton-type methods for non-convex optimization under inexact Hessian information. Math. Program. 184, 35–70 (2020). https://doi.org/10.1007/s10107-019-01405-z

Download citation

Keywords

  • Non-convex optimization
  • Inexact Hessian
  • Trust region
  • Cubic regularization
  • Randomized numerical linear algebra

Mathematics Subject Classification

  • 49M15
  • 65K05
  • 90C25
  • 90C06