Skip to main content
Log in

Generalized self-concordant functions: a recipe for Newton-type methods

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We study the smooth structure of convex functions by generalizing a powerful concept so-called self-concordance introduced by Nesterov and Nemirovskii in the early 1990s to a broader class of convex functions which we call generalized self-concordant functions. This notion allows us to develop a unified framework for designing Newton-type methods to solve convex optimization problems. The proposed theory provides a mathematical tool to analyze both local and global convergence of Newton-type methods without imposing unverifiable assumptions as long as the underlying functionals fall into our class of generalized self-concordant functions. First, we introduce the class of generalized self-concordant functions which covers the class of standard self-concordant functions as a special case. Next, we establish several properties and key estimates of this function class which can be used to design numerical methods. Then, we apply this theory to develop several Newton-type methods for solving a class of smooth convex optimization problems involving generalized self-concordant functions. We provide an explicit step-size for a damped-step Newton-type scheme which can guarantee a global convergence without performing any globalization strategy. We also prove a local quadratic convergence of this method and its full-step variant without requiring the Lipschitz continuity of the objective Hessian mapping. Then, we extend our result to develop proximal Newton-type methods for a class of composite convex minimization problems involving generalized self-concordant functions. We also achieve both global and local convergence without additional assumptions. Finally, we verify our theoretical results via several numerical examples, and compare them with existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)

    MathSciNet  MATH  Google Scholar 

  2. Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15(1), 595–627 (2014)

    MathSciNet  MATH  Google Scholar 

  3. Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)

    MATH  Google Scholar 

  4. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding agorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    MathSciNet  MATH  Google Scholar 

  5. Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Comput. 3(3), 165–218 (2011)

    MathSciNet  MATH  Google Scholar 

  6. Becker, S., Fadili, M.J.: A quasi-Newton proximal splitting method. In: Proceedings of Neutral Information Processing Systems Foundation (NIPS) (2012)

  7. Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization (2016). arXiv preprint arXiv:1609.08502

  8. Bonnans, J.F.: Local analysis of Newton-type methods for variational inequalities and nonlinear programming. Appl. Math. Optim. 29, 161–186 (1994)

    MathSciNet  MATH  Google Scholar 

  9. Borodin, A., El-Yaniv, R., Gogan, V.: Can we learn to beat the best stock. J. Artif. Intell. Res. (JAIR) 21, 579–594 (2004)

    MathSciNet  MATH  Google Scholar 

  10. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    MATH  Google Scholar 

  11. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)

    MathSciNet  MATH  Google Scholar 

  12. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

    Google Scholar 

  13. Chen, T.-Y., Demmel, J.W.: Balancing sparse matrices for computing eigenvalues. Linear Algebra Appl. 309(1–3), 261–287 (2000)

    MathSciNet  MATH  Google Scholar 

  14. Cohen, M., Madry, A., Tsipras, D., Vladu, A.: Matrix scaling and balancing via box constrained Newton’s method and interior-point methods. The 58th Annual IEEE Symposium on Foundations of Computer Science, pp. 902–913 (2017)

  15. Dennis, J.E., Moré, J.J.: A characterisation of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28, 549–560 (1974)

    MATH  Google Scholar 

  16. Deuflhard, P.: Newton Methods for Nonlinear Problems—Affine Invariance and Adaptative Algorithms, volume 35 of Springer Series in Computational Mathematics, 2nd edn. Springer, Berlin (2006)

    Google Scholar 

  17. Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91, 201–213 (2002)

    MathSciNet  MATH  Google Scholar 

  18. Erdogdu, M.A., Montanari, A.: Convergence rates of sub-sampled Newton methods. In: Advances in Neural Information Processing Systems, pp. 3052–3060 (2015)

  19. Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv preprint arXiv:1609.07358

  20. Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)

    MathSciNet  Google Scholar 

  21. Friedlander, M., Goh, G.: Efficient evaluation of scaled proximal operators. Electron. Trans. Numer. Anal. 46, 1–22 (2017)

    MathSciNet  MATH  Google Scholar 

  22. Gao, W., Goldfarb, D.: Quasi-Newton methods: superlinear convergence without line search for self-concordant functions (2016). arXiv preprint arXiv:1612.06965

  23. Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: IEEE Conference on Decision and Control, Los Angeles, USA, December 2014, pp. 5058–5063. CDC

  24. Goel, V., Grossmann, I.E.: A class of stochastic programs with decision dependent uncertainty. Math. Program. 108, 355–394 (2006)

    MathSciNet  MATH  Google Scholar 

  25. Grant, M., Boyd, S., Ye, Y.: Disciplined convex programming. In: Liberti, L., Maculan, N. (eds.) Global Optimization From Theory to Implementation, Nonconvex Optimization and its Applications, pp. 155–210. Springer, Berlin (2006)

    Google Scholar 

  26. Halko, N., Martinsson, P.-G., Tropp, J.A.: Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions (2009)

  27. Hazan, E., Arora, S.: Efficient Algorithms for Online Convex Optimization and their Applications. Princeton University, Princeton (2006)

    Google Scholar 

  28. He, N., Harchaoui, Z., Wang, Y., Song, L.: Fast and simple optimization for Poisson likelihood models (2016). arXiv preprint arXiv:1608.01264

  29. Hosmer, D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, New York (2005)

    MATH  Google Scholar 

  30. Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. JMLR W&CP 28(1), 427–435 (2013)

    Google Scholar 

  31. Krishnapuram, B., Figueiredo, M., Carin, L., Hartemink, H.: Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 27, 957–968 (2005)

    Google Scholar 

  32. Kyrillidis, A., Karimi, R., Tran-Dinh, Q., Cevher, V.: Scalable sparse covariance estimation via self-concordance. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 1946–1952 (2014)

  33. Lebanon, G., Lafferty, J.: Boosting and maximum likelihood for exponential models. Adv. Neural Inf. Process. Syst. (NIPS) 14, 447 (2002)

    Google Scholar 

  34. Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for convex optimization. SIAM J. Optim. 24(3), 1420–1443 (2014)

    MathSciNet  MATH  Google Scholar 

  35. Lu, Z.: Randomized block proximal damped Newton method for composite self-concordant minimization. SIAM J. Optim. 27(3), 1910–1942 (2016)

    MathSciNet  MATH  Google Scholar 

  36. Marron, J.S., Todd, M.J., Ahn, J.: Distance-weighted discrimination. J. Am. Stat. Assoc. 102(480), 1267–1271 (2007)

    MathSciNet  MATH  Google Scholar 

  37. McCullagh, P., Nelder, J.A.: Generalized Linear Models, vol. 37. CRC Press, Boca Raton (1989)

    MATH  Google Scholar 

  38. Monteiro, R.D.C., Sicre, M.R., Svaiter, B.F.: A hybrid proximal extragradient self-concordant primal barrier method for monotone variational inequalities. SIAM J. Optim. 25(4), 1965–1996 (2015)

    MathSciNet  MATH  Google Scholar 

  39. Nelder, J.A., Baker, R.J.: Generalized Linear Models. Encyclopedia of Statistical Sciences. Wiley, New York (1972)

    Google Scholar 

  40. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Dordrecht (2004)

    MATH  Google Scholar 

  41. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    MathSciNet  MATH  Google Scholar 

  42. Nesterov, Y.: Cubic regularization of Newton’s method for convex problems with constraints. CORE Discussion Paper 2006/39, Catholic University of Louvain (UCL) - Center for Operations Research and Econometrics (CORE) (2006)

  43. Nesterov, Y.: Accelerating the cubic regularization of Newtons method on convex problems. Math. Program. 112, 159–181 (2008)

    MathSciNet  MATH  Google Scholar 

  44. Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140(1), 125–161 (2013)

    MathSciNet  MATH  Google Scholar 

  45. Nesterov, Y., Nemirovski, A.: Interior-point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, Philadelphia (1994)

    Google Scholar 

  46. Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton’s method and its global performance. Math. Program. 112(1), 177–205 (2006)

    MathSciNet  MATH  Google Scholar 

  47. Nocedal, J., Wright, S.J.: Numerical Optimization, Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, Berlin (2006)

    Google Scholar 

  48. O’Donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2015)

    MathSciNet  MATH  Google Scholar 

  49. Odor, G., Li, Y.-H., Yurtsever, A., Hsieh, Y.-P., Tran-Dinh, Q., El-Halabi, M., Cevher, V.: Frank–Wolfe works for non-Lipschitz continuous gradient objectives: scalable Poisson phase retrieval. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6230–6234. IEEE (2016)

  50. Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Society for Industrial and Applied Mathematics, Philadelphia (2000)

    MATH  Google Scholar 

  51. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)

    Google Scholar 

  52. Parlett, B.N., Landis, T.L.: Methods for scaling to doubly stochastic form. Linear Algebra Appl. 48, 53–79 (1982)

    MathSciNet  MATH  Google Scholar 

  53. Peng, J., Roos, C., Terlaky, T.: Self-Regularity. A New Paradigm for Primal-Dual Interior-Point Algorithms. Princeton University Press, Princeton (2009)

    MATH  Google Scholar 

  54. Pilanci, M., Wainwright, M.J.: Newton sketch: a linear-time optimization algorithm with linear-quadratic convergence (2015). Arxiv preprint arXiv:1505.02250

  55. Polyak, R.A.: Regularized Newton method for unconstrained convex optimization. Math. Program. 120(1), 125–145 (2009)

    MathSciNet  MATH  Google Scholar 

  56. Robinson, S.M.: Strongly regular generalized equations. Math. Oper. Res. 5(1), 43–62, 5:43–62 (1980)

    MathSciNet  MATH  Google Scholar 

  57. Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods I: globally convergent algorithms (2016). arXiv preprint arXiv:1601.04737

  58. Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods II: local convergence rates (2016). arXiv preprint arXiv:1601.04738

  59. Ryu, E.K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft (2014)

  60. Toh, K.-Ch., Todd, M.J., Tütüncü, R.H.: On the implementation and usage of SDPT3—a Matlab software package for semidefinite-quadratic-linear programming. Tech. Report 4, NUS Singapore (2010)

  61. Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: A proximal Newton framework for composite minimization: graph learning without Cholesky decompositions and matrix inversions. JMLR W&CP 28(2), 271–279 (2013)

    Google Scholar 

  62. Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)

    MathSciNet  MATH  Google Scholar 

  63. Tran-Dinh, Q., Li, Y.-H., Cevher, V.: Composite convex minimization involving self-concordant-like cost functions. In: Pham Dinh, T., Le-Thi, H., Nguyen, N. (eds.) Modelling, Computation and Optimization in Information Systems and Management Sciences, pp. 155–168. Springer, New York (2015)

    Google Scholar 

  64. Tran-Dinh, Q., Necoara, I., Diehl, M.: A dual decomposition algorithm for separable nonconvex optimization using the penalty function framework. In: Proceedings of the Conference on Decision and Control (CDC), Florence, Italy, December, pp. 2372–2377 (2013)

  65. Tran-Dinh, Q., Necoara, I., Diehl, M.: Path-following gradient-based decomposition algorithms for separable convex optimization. J. Global Optim. 59(1), 59–80 (2014)

    MathSciNet  MATH  Google Scholar 

  66. Tran-Dinh, Q., Necoara, I., Savorgnan, C., Diehl, M.: An inexact perturbed path-following method for Lagrangian decomposition in large-scale separable convex optimization. SIAM J. Optim. 23(1), 95–125 (2013)

    MathSciNet  MATH  Google Scholar 

  67. Tran-Dinh, Q., Sun, T., Lu, S.: Self-concordant inclusions: a unified framework for path-following generalized Newton-type algorithms. Technical Report (submitted) (2016)

  68. Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)

    MATH  Google Scholar 

  69. Verscheure, D., Demeulenaere, B., Swevers, J., De Schutter, J., Diehl, M.: Time-optimal path tracking for robots: a convex optimization approach. IEEE Trans. Autom. Control 54, 2318–2327 (2009)

    MathSciNet  MATH  Google Scholar 

  70. Yamashita, M., Fujisawa, K., Kojima, M.: Implementation and evaluation of SDPA 6.0 (SemiDefinite Programming Algorithm 6.0). Optim. Method Softw. 18, 491–505 (2003)

    MathSciNet  MATH  Google Scholar 

  71. Yang, T., Lin, Q.: RSG: beating SGD without smoothness and/or strong convexity. CoRR abs/1512.03107 (2016)

  72. Zhang, Y., Lin, X.: DiSCO: Distributed optimization for self-concordant empirical loss. In: Proceedings of the 32th International Conference on Machine Learning, pp. 362–370 (2015)

Download references

Acknowledgements

This work is partially supported by the NSF-grant No. DMS-1619884, USA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quoc Tran-Dinh.

Appendix: The proof of technical results

Appendix: The proof of technical results

This appendix provides the full proofs of technical results presented in this paper. We prove some technical results used in the paper, and missing proofs in the main text. We also provide a full convergence analysis of the Newton-type methods presented in the main text.

1.1 The proof of Proposition 6: Fenchel’s conjugate

Let us consider the set \(\mathcal {X}:= \{x\in \mathbb {R}^p \mid f(u) - \langle x, u\rangle ~\text {is bounded from below on}~\mathrm {dom}(f)\}\). We first show that \(\mathrm {dom}(f^{*}) = \mathcal {X}\).

By the definition of \(\mathrm {dom}(f^{*})\), we have \(\mathrm {dom}(f^{*}) = \left\{ x\in \mathbb {R}^p \mid f^{*}(x) < +\,\infty \right\} \). Take any \(x\in \mathrm {dom}(f^{*})\), one has \(f^{*}(x) = \max _{u\in \mathrm {dom}(f)}\left\{ \langle x, u\rangle - f(u)\right\} <+\,\infty \). Hence, \(f(u) - \langle x,u\rangle \ge -f^{*}(x) > -\infty \) for all \(u\in \mathrm {dom}(f)\) which implies \(x\in \mathcal {X}\).

Conversely, assume that \(x\in \mathcal {X}\). By the definition of \(\mathcal {X}\), \(f(u)-\langle x,u\rangle \) is bounded from below for all \(u\in \mathrm {dom}(f)\). That is, there exists \(M \in [0, +\,\infty )\), such that \(f(u) - \langle x, u\rangle \ge -M\) for all \(u\in \mathrm {dom}(f)\). By the definition of the conjugate, \(f^{*}(x) = \max _{u\in \mathrm {dom}(f)}\left\{ \langle x, u\rangle - f(u)\right\} \le M < +\,\infty \). Hence, \(x\in \mathrm {dom}(f^{*})\).

For any \(x\in \mathrm {dom}(f^{*})\), the optimality condition of \(\max _{u}\left\{ \langle x, u\rangle - f(u)\right\} \) is \(x = \nabla {f}(u)\). Let us denote by \(x(u) = \nabla {f}(u)\). Then, we have \(f^{*}(x(u)) = \langle x(u), u\rangle - f(u)\). Taking derivative of \(f^{*}\) with respect to x on both sides, and using \(x(u)=\nabla f(u)\), we have

$$\begin{aligned} \nabla _x f^{*}(x(u)) = u + u'_xx(u) - u'_x\nabla f(u) = u. \end{aligned}$$

We further take the second-order derivative of the above equation with respect to u to get

$$\begin{aligned} \nabla ^2f^{*}(x(u))x_u'(u) = \mathbb {I}. \end{aligned}$$

Using the two relations above and the fact that \(x_u'(u) = \nabla ^2{f}(u)\), we can derive

$$\begin{aligned} \langle \nabla f^{*}(x(u)),x_u'(u)v\rangle&= \langle u,x_u'(u)v\rangle = \langle \nabla ^2f(u)v, u\rangle \end{aligned}$$
(50)
$$\begin{aligned} \langle \nabla ^2{f^{*}}(x(u))x_u'(u)v, x_u'(u)w\rangle&= \langle v, x_u'(u)w\rangle = \langle \nabla ^2f(u)v, w\rangle , \end{aligned}$$
(51)

where \(u\in \mathrm {dom}(f)\), and \(v, w\in \mathbb {R}^p\). Using (50) and (51), we can compute the third-order derivative of \(f^{*}\) with respect to x(u) as

$$\begin{aligned} \langle \nabla ^3 f^{*}(x(u))[x_u'(u)w]x_u'(u)v,&x_u'(u)v\rangle = \langle \left( \langle \nabla ^2{f}^{*}(x(u))x_u'(u)v, x_u'(u)v\rangle \right) '_{u}, w\rangle \nonumber \\&- 2\langle \nabla ^2 f^{*}(x(u))x_u'(u)v, (x_u'(u)v)'_uw\rangle \nonumber \\&\overset{\tiny (50)}{=} \langle (\langle x_u'(u)v,v\rangle )'_u,w\rangle \nonumber \\&-2\langle \nabla ^2f^{*}(x(u))x_u'(u)v, (x_u'(u)v)'_uw\rangle \nonumber \\&\overset{\tiny (51)}{=} \langle \nabla ^3 f(u)[w]v,v\rangle -2\langle (x_u'(u)v)_u'w,v\rangle \nonumber \\&= -\langle \nabla ^3 f(u)[w]v,v\rangle . \end{aligned}$$
(52)

Denote \(\xi := x_u'(u)w\) and \(\eta := x_u'(u)v\). Since \(x_u'(u) = \nabla ^2{f}(u)\), we have \(\xi = \nabla ^2{f}(u)w\), \(\eta = \nabla ^2{f}(u)v\), and \(w = \nabla ^2{f}(u)^{-1}\xi \). Using these relations and \(\nabla ^2f^{*}(x(u))x_u'(u) = \mathbb {I}\), we can derive

$$\begin{aligned} \begin{array}{ll} \vert \langle \nabla ^3{f^{*}}(x(u))[\xi ]\eta , \eta \rangle \vert &{}\overset{\tiny (52)}{=} \vert \langle \nabla ^3{f}(u)[w]v, v\rangle \overset{\tiny (5)}{\le } M_f\left\| v\right\| _u^2\left\| w\right\| _u^{\nu - 2}\left\| w\right\| _2^{3-\nu } \\ &{}= M_f\langle \nabla ^2f(u)v, v\rangle \langle \nabla ^2{f}(u)w, w\rangle ^{\frac{\nu -2}{2}}\Vert w\Vert ^{3-\nu }_2 \\ &{}= M_f\langle \eta , \nabla ^2f^{*}(x(u))x'(u)v\rangle \\ &{}\quad \langle \xi , \nabla ^2f^{*}(x(u))x'(u)w\rangle ^{\frac{\nu -2}{2}}\Vert \nabla ^2{f}(u)^{-1}\xi \Vert ^{3-\nu } \\ &{}= M_f\langle \nabla ^2f^{*}(x(u))\eta , \eta \rangle \langle \nabla ^2f^{*}(x(u))\xi , \xi \rangle ^{\frac{\nu -2}{2}}\\ &{}\quad \langle \nabla ^2f^{*}(x(u))\xi , \nabla ^2f^{*}(x(u))\xi \rangle ^{3-\nu }. \end{array} \end{aligned}$$

For any \(H\in \mathcal {S}^p_{++}\), we have \(\langle H\xi , \xi \rangle \le \left\| H\xi \right\| _2\left\| \xi \right\| _2\). For any \(\nu \ge 3\), this inequality leads to

$$\begin{aligned} \langle H\xi , \xi \rangle ^{\frac{\nu -2}{2}}\left\| H\xi \right\| ^{3-\nu } \le \langle H\xi ,\xi \rangle ^{\frac{4-\nu }{2}}\left\| \xi \right\| _2^{\nu -3}. \end{aligned}$$

Using this inequality with \(H = \nabla ^2f^{*}(x(u))\) into the last expression, we obtain

$$\begin{aligned} \begin{array}{ll} \vert \langle \nabla ^3{f^{*}}(x(u))[\xi ]\eta , \eta \rangle \vert &{}\le M_f\langle \nabla ^2f^{*}(x(u))\eta , \eta \rangle \langle \nabla ^2f^{*}(x(u))\xi , \xi \rangle ^{\frac{4 - \nu }{2}}\left\| \xi \right\| _2^{\nu -3}\\ &{}= M_f\Vert \eta \Vert _{x(u)}^2\left\| \xi \right\| _{x(u)}^{4-\nu }\Vert \xi \Vert _2^{\nu - 3}. \end{array} \end{aligned}$$

By Definition 2, we need \(\nu - 3 = 3 - \nu _{*}\) and \(4-\nu = \nu _{*} - 2\) which hold if \(\nu _{*} = 6 - \nu \). Under the choice of \(\nu _{*}\), the above inequality shows that \(f^{*}\) is \((M_{f^{*}}, \nu _{*})\)-generalized self-concordant with \(M_{f^{*}} = M_f\) and \(\nu _{*} = 6 - \nu \). However, to guarantee \(\nu - 3 \ge 0\) and \(6 - \nu > 0\), we require \(3 \le \nu < 6\).

Finally, we prove the case of univariate functions, i.e., \(p = 1\). Indeed, we have

$$\begin{aligned} x(u)=f'(u),~~ (f^{*})'(x(u))=u,~~\text {and}~~(f^{*})''(x(u))x'(u)=1. \end{aligned}$$
(53)

Here, \(f'\) is the derivative of f with respect to u. Taking the derivative of the last equation on both sides with respect to u, we obtain

$$\begin{aligned} (f^{*})'''(x(u))(x'(u))^2+(f^{*})''(x(u))x''(u)=0. \end{aligned}$$

Solving this equation for \((f^{*})'''(x(u))\) and then using (53) and \(x''(u) = f'''(u)\), we get

$$\begin{aligned} \begin{array}{rl} \left| (f^{*})'''(x(u))\right| =&{} \left| \frac{(f^{*})''(x(u))x''(u)}{(x'(u))^2}\right| = \left| ((f^{*})''(x(u)))^3f'''(u)\right| \\ \le &{} M_f\left| ((f^{*})''(x(u)))^3(f''(u))^{\frac{\nu }{2}}\right| = M_f((f^{*})''(x(u)))^{\frac{6-\nu }{2}}. \end{array} \end{aligned}$$

This inequality shows that \(f^{*}\) is generalized self-concordant with \(\nu _{*} = 6 - \nu \) for any \(\nu \in (0, 6)\). \(\square \)

1.2 The proof of Corollary 2: bound on the mean of Hessian operator

Let \(y_{\tau } := x+ \tau (y- x)\). Then \(d_{\nu }(x, y_{\tau }) = \tau d_{\nu }(x, y)\). By (15), we have \(\nabla ^2{f}(x+ \tau (y- x)) \preceq \left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{-2}{\nu -2}}\nabla ^2{f}(x)\) and \(\nabla ^2{f}(x+ \tau (y- x)) \succeq \left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{2}{\nu -2}}\nabla ^2{f}(x)\). Hence, we have

$$\begin{aligned} \underline{I}_{\nu }(x,y)\nabla ^2{f}(x) \preceq \int _0^1\nabla ^2{f}(x+ \tau (y- x))d\tau \preceq \overline{I}_{\nu }(x, y)\nabla ^2{f}(x), \end{aligned}$$

where \(\underline{I}_{\nu }(x, y) := \int _0^1\left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{2}{\nu -2}}d\tau \) and \(\overline{I}_{\nu }(x, y) := \int _0^1\left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{-2}{\nu -2}}d\tau \) are the two integrals in the above inequality. Computing these integrals explicitly, we can show that

  • If \(\nu = 4\), then \(\underline{I}_{\nu }(x,y) = \frac{1 - (1 - d_4(x,y))^2}{2d_4(x,y)}\) and \(\overline{I}_{\nu }(x, y) = \frac{-\ln (1 - d_4(x,y))}{d_4(x,y)}\).

  • If \(\nu \ne 4\), then we can easily compute \(\underline{I}_{\nu }(x, y) = \frac{(\nu -2)}{\nu d_{\nu }(x,y)}\left( 1 - \left( 1 - d_{\nu }(x,y)\right) ^{\frac{\nu }{\nu -2}}\right) \), and \(\overline{I}_{\nu }(x, y) = \frac{(\nu -2)}{(\nu -4)d_{\nu }(x,y)}\left( 1 - \left( 1 - d_{\nu }(x,y)\right) ^{\frac{\nu -4}{\nu -2}}\right) \).

Hence, we obtain (18).

Finally, we prove for the case \(\nu = 2\). Indeed, by (16), we have \(e^{-d_2(x,y_{\tau })}\nabla ^2f(x) \preceq \nabla ^2f(y_{\tau }) \preceq e^{d_2(x,y_{\tau })}\nabla ^2f(x)\). Since \(d_2(x, y_{\tau }) = \tau d_2(x, y)\), the last estimate leads to

$$\begin{aligned} \left( \int _0^1e^{-d_2(x,y)\tau }d\tau \right) \nabla ^2f(x) \preceq \int _0^1\nabla ^2f(y_{\tau })d\tau \preceq \left( \int _0^1e^{d_2(x,y)\tau }d\tau \right) \nabla ^2f(x), \end{aligned}$$

which is exactly (18). \(\square \)

1.3 Techical lemmas

The following lemmas will be used in our analysis. Lemma 1 is elementary, but we provide its proof for completeness.

Lemma 1

  1. (a)

    For a fixed \(r \ge 1\) and \(\bar{t} \in (0, 1)\), consider a function \(\psi _r(t) := \frac{1 - (1-t)^r - rt(1-t)^r}{rt^2(1-t)^r}\) on \(t\in (0, 1)\). Then, \(\psi \) is positive and increasing on \((0, \bar{t}]\) and

    $$\begin{aligned} \lim _{t\rightarrow 0^{+}}\psi _r(t) = \tfrac{r+1}{2},~~\lim _{t\rightarrow 1^{-}}\psi _r(t) = +\,\infty ,~~\text {and} ~~~\sup _{0 \le t \le \bar{t}}\left| \psi _r(t)\right| \le \bar{C}_r(\bar{t}) < +\,\infty , \end{aligned}$$

    where \(\bar{C}_r(\bar{t}) := \frac{1 - (1-\bar{t})^r - r\bar{t}(1-\bar{t})^r}{r\bar{t}^2(1-\bar{t})^r} \in (0, +\,\infty )\).

  2. (b)

    For \(t > 0\), we also have \(\frac{e^{t} - 1 - t}{t} \le \left( \frac{3}{2} + \frac{t}{3}\right) te^t\).

Proof

The statement \(\mathrm {(b)}\) is rather elementary, we only prove \(\mathrm {(a)}\). Since \(r \ge 1\), \(\lim _{t\rightarrow 0^{+}}(1 - (1-t)^r - rt(1-t)^r) = \lim _{t\rightarrow 0^{+}}rt^2(1-t)^r = 0\) and \(rt^2(1-t)^r > 0\) for \(t\in (0, 1)\), applying L’H\(\hat{\mathrm {o}}\)spital’s rule, we have

$$\begin{aligned} \lim _{t\rightarrow 0^+}\psi _r(t)= & {} \frac{\lim _{t\rightarrow 0^+}r(r+1)t(1-t)^{r-1}}{\lim _{t\rightarrow 0^+}rt(2-(2+r)t)(1-t)^{r-1}}\\= & {} \frac{\lim _{t\rightarrow 0^+}(r+1)}{\lim _{t\rightarrow 0^+}(2-(2+r)t)}=\frac{r+1}{2}. \end{aligned}$$

The limit \(\lim _{t\rightarrow 1^{-}}\psi _r(t) = +\,\infty \) is obvious.

Next, it is easily to compute \(\psi '_r(t) = \frac{(1-t)^{r+1}(rt+2)+(r+2)t-2}{rt^3(1-t)^{r+1}}\). Let \(m_r(t) := (1-t)^{r+1}(rt+2)+(r+2)t-2\) be the numerator of \(\psi '_r(t)\).

We have \(m_r'(t) = r+2 - (1-t)^r(r^2t+2rt+r+2)\), and \(m_r''(t) = r(r+1)(r+2)t(1-t)^{r-1}\). Clearly, since \(r \ge 1\), \(m_r''(t) \ge 0\) for \(t \in [0, 1]\). This implies that \(m_r'\) is nondecreasing on [0, 1]. Hence, \(m_r'(t) \ge m_r'(0) = 0\) for all \(t \in [0, 1]\). Consequently, \(m_r\) is nondecreasing on [0, 1]. Therefore, \(m_r(t) \ge m_r(0) = 0\) for all \(t\in [0, 1]\). Using the formula of \(\psi '_r\), we can see that \(\psi '_r(t) \ge 0\) for all \(t \in (0, 1)\). This implies that \(\psi _r\) is nondecreasing on (0, 1). Moreover, \(\lim _{t\rightarrow 0^+}\psi _r(t) = \frac{r+1}{2} > 0\). Hence, \(\psi _r(t) > 0\) for all \(t\in (0, 1)\). This implies that \(\psi _r\) is bounded on \((0, \bar{t}] \subset (0, 1)\) by \(\psi _r(\bar{t})\). \(\square \)

Similar to Corollary 2, we can prove the following lemma on the bound of the Hessian difference.

Lemma 2

Given \(x, y\in \mathrm {dom}(f)\), the matrix \(H(x,y)\) defined by

$$\begin{aligned} H(x,y) := \nabla ^2f(x)^{-1/2}\left[ \int _0^1\big (\nabla ^2{f}(x+ \tau (y-x)) - \nabla ^2f(x)\big )d\tau \right] \nabla ^2f(x)^{-1/2},\qquad \end{aligned}$$
(54)

satisfies

$$\begin{aligned} \Vert H(x,y) \Vert \le R_{\nu }\left( d_{\nu }(x, y)\right) d_{\nu }(x, y), \end{aligned}$$
(55)

where \(R_{\nu }(t)\) is defined as follows for \(t \in [0, 1)\):

$$\begin{aligned} R_{\nu }(t) := {\left\{ \begin{array}{ll} \left( \frac{3}{2} + \frac{t}{3}\right) e^t &{}\quad \text {if}\,\, \nu = 2\\ \frac{1 - (1-t)^{\frac{4-\nu }{\nu -2}} - \left( \frac{4-\nu }{\nu -2}\right) t(1-t)^{\frac{4-\nu }{\nu -2}}}{\left( \frac{4-\nu }{\nu -2}\right) t^2(1-t)^{\frac{4-\nu }{\nu -2}}} &{}\quad \text {if}\, 2 < \nu \le 3. \end{array}\right. } \end{aligned}$$
(56)

Moreover, for a fixed \(\bar{t} \in (0, 1)\), we have \(\sup _{0 \le t \le \bar{t}}\left| R_{\nu }(t)\right| \le \bar{M}_{\nu }(\bar{t})\), where

$$\begin{aligned} \bar{M}_{\nu }(\bar{t}) := \max \left\{ \frac{1 - (1 - \bar{t})^{\frac{4 - \nu }{\nu - 2}} - \left( \frac{4-\nu }{\nu - 2}\right) \bar{t}(1 - \bar{t})^{\frac{4 - \nu }{\nu - 2}}}{\left( \frac{4 - \nu }{\nu - 2}\right) \bar{t}^2(1 - \bar{t})^{\frac{4 - \nu }{\nu -2}}}, \left( \frac{3}{2} + \frac{\bar{t}}{2}\right) e^{\bar{t}} \right\} \in (0, +\,\infty ). \end{aligned}$$

Proof

By Corollary 2, if we define \(G(x,y) := \int _0^1 \left[ \nabla ^2{f}(x+ \tau (y-x)) - \nabla ^2{f}(x)\right] d\tau \), then

$$\begin{aligned} \left[ \underline{\kappa }_{\nu }(d_{\nu }(x,y)) - 1\right] \nabla ^2{f}(x) \preceq G(x,y) \preceq \left[ \overline{\kappa }_{\nu }(d_{\nu }(x,y)) - 1\right] \nabla ^2{f}(x). \end{aligned}$$
(57)

Since \(H(x,y) = \nabla ^2f(x)^{-1/2}G(x,y)\nabla ^2f(x)^{-1/2}\), the last inequality implies

$$\begin{aligned} \Vert H(x,y) \Vert \le \max \big \{1 - \underline{\kappa }_{\nu }(d_{\nu }(x,y)), \overline{\kappa }_{\nu }(d_{\nu }(x,y)) - 1\big \}. \end{aligned}$$

Let \(C_{\max }(t) := \max \big \{1 - \underline{\kappa }_{\nu }(t), \overline{\kappa }_{\nu }(t) - 1 \big \}\) be for \(t \in [0, 1)\). We consider three cases.

(a) For \(\nu = 2\), since \(e^{-t} + e^{t} \ge 2\), we have \(\frac{1-e^{-t}}{t}+ \frac{e^{t}-1}{t} \ge 2\) which implies \(C_{\max }(t) = \overline{\kappa }_{\nu }(t) - 1 = \frac{e^{t}-1-t}{t}\). Hence, by Lemma 1, we have \(C_{\max }(t) \le \left( \frac{3}{2} + \frac{t}{3}\right) te^t\) which leads to \(R_{\nu }(t) := \left( \frac{3}{2} + \frac{t}{3}\right) e^t\).

(b) For \(\nu \in (2, 3]\), we have

$$\begin{aligned} \begin{array}{ll} C_{\max }(t) &{}= \max \left\{ 1 - \frac{(\nu - 2)}{\nu t}\left[ 1 - (1 - t)^{\frac{\nu }{\nu - 2}}\right] , \frac{(\nu - 2)}{(4 - \nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4 - \nu }{\nu -2}}} - 1\Big ] - 1\right\} \\ &{}= \frac{(\nu - 2)}{(4 - \nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4-\nu }{\nu -2}}} - 1\Big ] - 1. \end{array} \end{aligned}$$

Indeed, we show that \(\frac{(\nu - 2)}{(4 -\nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4-\nu }{\nu -2}}} - 1\Big ] + \frac{(\nu - 2)}{\nu t}\left[ 1 - (1 - t)^{\frac{\nu }{\nu - 2}}\right] \ge 2\). Let \(u := \frac{4-\nu }{\nu -2} > 0\) and \(v := \frac{\nu }{\nu -2} > 0\). The last inequality is equivalent to \(\frac{1}{u}\left[ \frac{1}{(1 - t)^u}-1\right] + \frac{1}{v}\left[ 1 - (1-t)^v\right] \ge 2t\) which can be reformulated as \(\frac{1}{v} - \frac{1}{u} + \frac{1}{u(1-t)^u} - \frac{(1-t)^v}{v} - 2t \ge 0\). Consider \(s(t) := \frac{1}{v} - \frac{1}{u} + \frac{1}{u(1-t)^u} - \frac{(1-t)^v}{v} - 2t\). It is clear that \(s'(t) = \frac{1}{(1-t)^{u+1}} + (1-t)^{v-1} - 2 = (1-t)^{-\frac{2}{\nu -2}} + (1-t)^{\frac{2}{\nu -2}} - 2 \ge 0\) for all \(t\in [0, 1)\). We obtain \(s(t) \ge s(0) = 0\). Hence, \(C_{\max }(t) = \frac{(\nu - 2)}{(4 - \nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4-\nu }{\nu -2}}} - 1\Big ] - 1\).

Let us define \(r := \frac{4-\nu }{\nu -2} = \frac{2}{\nu -2} - 1\). Then, it is clear that \(\nu = 2 + \frac{2}{1+r}\), and \(\nu \in (2, 3]\) is equivalent to \(r \ge 1\). Now, using Lemma 1 with \(r = \frac{2}{\nu -2} - 1 \ge 1\), we obtain \(R_{\nu }(t) := \frac{1 - (1-t)^{\frac{4-\nu }{\nu -2}} - \left( \frac{4-\nu }{\nu -2}\right) t(1-t)^{\frac{4-\nu }{\nu -2}}}{\left( \frac{4-\nu }{\nu -2}\right) t^2(1-t)^{\frac{4-\nu }{\nu -2}}}\). Put (a) and (b) together, we obtain (55) with \(R_{\nu }\) defined by (56). The boundedness of \(R_{\nu }\) follows from Lemma 1. \(\square \)

1.4 The proof of Theorem 4: solution existence and uniqueness

Consider a sublevel set \(\mathcal {L}_F(x):=\left\{ y\in \mathrm {dom}(F) \mid F(y)\le F(x)\right\} \) of F in (32). For any \(y\in \mathcal {L}_F(x)\) and \(v\in \partial g(x)\), by (22) and the convexity of g, we have

$$\begin{aligned} F(x)\ge F(y)\ge F(x)+\left\langle \nabla f(x)+v,y- x\right\rangle +\omega _{\nu }\left( -d_{\nu }(x, y)\right) \left\| y-x\right\| _{x}^2. \end{aligned}$$

By the Cauchy-Schwarz inequality, we have

$$\begin{aligned} \omega _{\nu }\left( -d_{\nu }(x, y)\right) \left\| y-x\right\| _{x}\le \left\| \nabla f(x)+v\right\| _{x}^{*}. \end{aligned}$$
(58)

Now, using the assumption \(\nabla ^2{f}(x)\succ 0\) for some \(x\in \mathrm {dom}(F)\), we have \(\sigma _{\min }(x) := \lambda _{\min }(\nabla ^2{f}(x)) > 0\), the smallest eigenvalue of \(\nabla ^2{f}(x)\).

  1. (a)

    If \(\nu = 2\), then \(d_2(x,y)=M_f\left\| y-x\right\| _2\le \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\left\| y-x\right\| _{x}\). This estimate together with (58) imply

    $$\begin{aligned} \omega _2\left( -d_2(x, y)\right) d_2(x,y)\le \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\left\| \nabla f(x)+v\right\| _{x}^{*} = \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x). \end{aligned}$$
    (59)

    We consider the function \(s_2(t) := \omega _2(-t)t = 1 - \frac{1-e^{-t}}{t}\). Clearly, \(s_2'(t) = \frac{e^t - t - 1}{t^2e^t} > 0\) for all \(t \in \mathbb {R}_{+}\). Hence, \(s_2(t)\) is increasing on \(\mathbb {R}_{+}\). However, \(s_2(t) < 1\) and \(\lim \limits _{t\rightarrow +\,\infty }s_2(t) = 1\). Therefore, if \(\frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x) < 1\), then the equation \(s_2(t) - \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x) = 0\) has a unique solution \(t^{*} \in (0, +\,\infty )\). In this case, for \(0 \le d_2(x, y) \le t^{*}\), (59) holds. This condition leads to \(M_f\left\| y-x\right\| _2 \le t^{*} <+\,\infty \) which implies that the sublevel set \(\mathcal {L}_F(x)\) is bounded. Consequently, solution \(x^{\star }\) of (32) exists.

  2. (b)

    If \(2< \nu < 3\), then

    $$\begin{aligned} d_{\nu }(x,y)\le \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\left\| y-x\right\| _{x}. \end{aligned}$$

    This inequality together with (58) imply

    $$\begin{aligned} \omega _{\nu }\left( -d_{\nu }(x, y)\right) d_{\nu }(x,y)\le & {} \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\left\| \nabla f(x)+v\right\| _{x}^{*}\\= & {} \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\lambda (x). \end{aligned}$$

    We consider \(s_{\nu }(t) := \omega _{\nu }(-t)t\). After a few elementary calculations, we can easily check that \(s_{\nu }\) is increasing on \(\mathbb {R}_{+}\) and \(s_{\nu }(t) < \frac{\nu -2}{4-\nu }\) for all \(t > 0\), and \(\lim \limits _{t\rightarrow +\,\infty }s_{\nu }(t) = \frac{\nu -2}{4-\nu }\). Hence, if \(\left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\lambda (x) < \frac{\nu -2}{4-\nu }\), then, similar to Case (a), we can show that solution \(x^{\star }\) of (32) exists. This condition implies that \(\lambda (x) < \frac{2\sigma _{\min }(x)^{\frac{3-\nu }{2}}}{(4-\nu )M_f}\).

  3. (c)

    If \(\nu = 3\), then \(d_3(x,y) = \frac{M_f}{2}\left\| y-x\right\| _{x}\). Combining this estimate and (58) we get

    $$\begin{aligned} \omega _3\left( -d_3(x, y)\right) d_3(x,y)\le \frac{M_f}{2}\left\| \nabla f(x)+v\right\| _{x}^{*}. \end{aligned}$$

    With the same proof as in [40, Theorem 4.1.11], if \(\frac{M_f}{2}\left\| \nabla f(x)+v\right\| _{x}^{*} < 1\) which is equivalent to \(\lambda (x) < \frac{2}{M_f}\), then solution \(x^{\star }\) of (32) exists.

Note that the condition on \(\lambda (x)\) in three cases (a), (b), and (c) can be unified. The uniqueness of the solution \(x^{\star }\) in these three cases follows from the strict convexity of F. \(\square \)

1.5 The proof of Theorem 2: convergence of the damped-step Newton method

The proof of this theorem is divided into two parts: computing the step-size, and proving the local quadratic convergence.

Computing the step-size\(\tau _k\): From Proposition 10, for any \(x^k,x^{k+1}\in \mathrm {dom}(f)\), if \(d_{\nu }(x^k,x^{k+1}) < 1\), then we have

$$\begin{aligned} f(x^{k+1}) \le f(x^k) + \langle \nabla {f}(x^k), x^{k+1}-x^k\rangle + \omega _{\nu }\left( d_{\nu }(x^k, x^{k+1})\right) \left\| x^{k+1} - x^k\right\| _{x^k}^2. \end{aligned}$$

Now, using (25), we have \(\langle \nabla {f}(x^k), x^{k+1}-x^k\rangle = -\tau _k\left( \Vert \nabla {f}(x^k)\Vert _{x^k}^{*}\right) ^2 = -\tau _k\lambda _k^2\). On the other hand, we have

$$\begin{aligned} \begin{array}{ll} &{}\Vert x^{k+1} - x^k\Vert _{x^k}^2 \overset{\tiny (25)}{=} \tau _k^2\langle \nabla ^2f(x^k)^{-1}\nabla {f}(x^k),\nabla {f}(x^k)\rangle \overset{\tiny (27)}{=} \tau _k^2\lambda _k^2, \\ &{}\Vert x^{k+1} - x^k\Vert _2^2 \overset{\tiny (25)}{=} \tau _k^2\langle \nabla ^2f(x^k)^{-1}\nabla {f}(x^k), \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)\rangle \overset{\tiny (27)}{=} \frac{\tau _k^2\beta _k^2}{M_f^2}.\\ \end{array} \end{aligned}$$

Using the definition of \(d_{\nu }(\cdot )\) in (12), the two last equalities, and (28), we can easily show that \(d_{\nu }(x^k, x^{k+1}) = \tau _kd_k\). Substituting these relations into the first estimate, we obtain

$$\begin{aligned} f(x^{k+1}) \le f(x^k) - \left( \tau _k\lambda _k^2 - \omega _{\nu }\left( \tau _kd_k\right) \tau _k^2\lambda _k^2\right) . \end{aligned}$$

We consider the following cases:

(a) If \(\nu = 2\), then, by (23), we have \(\eta _k(\tau ) := \lambda _k^2\tau - \left( \frac{\lambda _k}{d_k}\right) ^2\left( e^{\tau d_k} - \tau d_k - 1\right) \) with \(d_k = \beta _k\). This function attains the maximum at \(\tau _k := \frac{\ln (1 + d_k)}{d_k} = \frac{\ln (1 + \beta _k)}{\beta _k} \in (0, 1)\) with

$$\begin{aligned} \eta _k(\tau _k) = \left( \frac{\lambda _k}{d_k}\right) ^2\Big [ (1 + d_k)\ln (1 + d_k) - d_k\Big ] = \left( \frac{\lambda _k}{\beta _k}\right) ^2\Big [ (1 + \beta _k)\ln (1 + \beta _k) - \beta _k\Big ]. \end{aligned}$$

It is easy to check from the right-most term of the last expression that \(\varDelta _k := \eta _k(\tau _k) > 0\) for \(\tau _k > 0\).

(b) If \(\nu = 3\), then, by (23), we have \(\eta _k(\tau ) := \lambda _k^2\tau + \left( \frac{\lambda _k}{d_k}\right) ^2\left[ \tau d_k + \ln (1 - \tau d_k)\right] \) with \(d_k = 0.5M_f\lambda _k\). We can show that \(\eta _k(\tau )\) achieves the maximum at \(\tau _k = \frac{1}{1 + d_k} = \frac{1}{1 + 0.5M_f\lambda _k}\in (0,1)\) with

$$\begin{aligned} \eta _k(\tau _k) = \frac{\lambda _k^2}{1 + 0.5M_f\lambda _k}+\left( \frac{2}{M_f}\right) ^2\left[ \frac{0.5M_f\lambda _k}{1 + 0.5M_f\lambda _k}+\ln \left( 1-\frac{0.5M_f\lambda _k}{1 + 0.5M_f\lambda _k}\right) \right] . \end{aligned}$$

We can also easily check that the last term \(\varDelta _k := \eta _k(\tau _k)\) of this expression is positive for \(\lambda _k > 0\).

(c) If \(2< \nu < 3\), then we have \(d_k=M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \lambda _k^{\nu -2}\beta _k^{3-\nu }\). By (23), we have

$$\begin{aligned} \eta _k(\tau )= & {} \left( \lambda _k^2+\frac{\lambda _k^2}{d_k}\frac{\nu -2}{4-\nu }\right) \tau -\left( \frac{\lambda _k}{d_k}\right) ^2\frac{(\nu -2)^2}{2(4-\nu )(3-\nu )}\left( (1 - \tau d_k)^{\frac{2(3-\nu )}{2-\nu }} - 1\right) . \end{aligned}$$

Our aim is to find \(\tau ^{*} \in (0, 1]\) by solving \(\max _{\tau \in [0, 1]}\eta _k(\tau )\). This problem always has a global solution. First, we compute the first- and the second-order derivatives of \(\eta _k\) as follows:

$$\begin{aligned} \eta _k'(\tau ) = \lambda _k^2\left[ 1 - \frac{1}{d_k}\frac{\nu -2}{\nu - 4}\left( 1-(1-\tau d_k)^{\frac{\nu -4}{\nu -2}}\right) \right] \text { and }\eta _k''(\tau )=-\lambda _k^2(1-\tau d_k)^{\frac{-2}{\nu -2}}. \end{aligned}$$

Let us set \(\eta _k'(\tau _k) = 0\). Then, we get

$$\begin{aligned} \tau _k = \frac{1}{d_k}\left[ 1-\left( 1+\frac{4-\nu }{\nu -2}d_k\right) ^{-\frac{\nu -2}{4-\nu }}\right] \in (0,1)~~~~{\text {(by the Bernoulli inequality)}}, \end{aligned}$$

with

$$\begin{aligned} \eta _k(\tau _k)= & {} \frac{\lambda _k^2}{d_k}\left[ 1-\frac{4-\nu }{2(3-\nu )}\left( 1+\frac{4-\nu }{\nu -2}d_k\right) ^{2-\nu }\right] \\&+\left( \frac{\lambda _k}{d_k}\right) ^2 \frac{\nu -2}{2(3-\nu )}\left[ 1-\left( 1+\frac{4-\nu }{\nu -2}d_k\right) ^{2-\nu }\right] . \end{aligned}$$

In addition, we can check that \(\eta _k''(\tau _k) < 0\). Hence, the value of \(\tau _k\) above achieves the maximum of \(\eta _k(\cdot )\). Then, we have \(\varDelta _k := \eta _k(\tau _k) > \eta _k(0)=0\).

The proof of local quadratic convergence Let \(x^{\star }_f\) be the optimal solution of (24). We have

$$\begin{aligned} \begin{array}{ll} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k} &{}= \Vert x^k - \tau _k\nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k) - x^{\star }_f\Vert _{x^k} \\ &{}= (1-\tau _k)\Vert x^k - x^{\star }_f\Vert _{x^k} + \tau _k\Vert x^k - x^{\star }_f - \nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k)\Vert _{x^k}. \end{array} \end{aligned}$$

Hence, we can write

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k}= & {} (1 - \tau _k)\Vert x^k - x^{\star }_f\Vert _{x^k} + \tau _k\Vert \nabla ^2{f}(x^k)^{-1}\nonumber \\&\times \left[ \nabla {f}(x^{\star }_f) - \nabla {f}(x^k) - \nabla ^2{f}(x^k)(x^{\star }_f - x^k)\right] \Vert _{x^k}. \end{aligned}$$
(60)

Let us define \(T_k := \Big \Vert \nabla ^2{f}(x^k)^{-1}\left[ \nabla {f}(x^{\star }_f) - \nabla {f}(x^k) - \nabla ^2{f}(x^k)(x^{\star }_f - x^k)\right] \Big \Vert _{x^k}\) and consider three cases as follows:

\(\mathrm {(a)}\)  For \(\nu = 2\), using Corollary 2, we have \(\left( \frac{1-e^{-\bar{\beta }_k}}{\bar{\beta }_k}\right) \nabla ^2{f}(x^k) \preceq \int _0^1\nabla ^2{f}(x^k + t(x^{\star }_f -x^k))dt \preceq \left( \frac{e^{\bar{\beta }_k}-1}{\bar{\beta }_k}\right) \nabla ^2{f}(x^k)\), where \(\bar{\beta }_k := M_f\Vert x^k - x^{\star }_f\Vert _2\). Using the above inequality, we can show that

$$\begin{aligned} T_k\le & {} \max \left\{ 1 - \frac{1-e^{- \bar{\beta }_k}}{\bar{\beta }_k}, \frac{e^{\bar{\beta }_k}-1}{\bar{\beta }_k}-1\right\} \Vert x^k - x^{\star }_f\Vert _{x^k} \\= & {} \left( \frac{e^{\bar{\beta }_k} - 1 - \bar{\beta }_k}{\bar{\beta }_k^2}\right) \bar{\beta }_k\Vert x^k - x^{\star }_f\Vert _{x^k}. \end{aligned}$$

Let \(\underline{\sigma }_k := \lambda _{\min }(\nabla ^2{f}(x^k))\). We first derive

$$\begin{aligned} \begin{array}{ll} \Vert \nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k)\Vert _2 &{}= \Vert \nabla ^2{f}(x^k)^{-1}(\nabla {f}(x^k) - \nabla {f}(x^{\star }_f))\Vert _2 \\ &{}= \Vert \int _0^1\nabla ^2{f}(x^k)^{-1}\nabla ^2{f}(x^k + t(x^{\star }_f - x^k))(x^k - x^{\star }_f) dt\Vert _2 \\ &{}= \Vert \nabla ^2{f}(x^k)^{-1/2}K(x^k,x^{\star }_f)\nabla ^2{f}(x^k)^{1/2}(x^k - x^{\star }_f)\Vert _2 \\ &{}\le \frac{1}{\sqrt{\underline{\sigma }_k}}\Vert K(x^k,x^{\star }_f)\Vert \Vert x^k - x^{\star }_f\Vert _{x^k}. \end{array} \end{aligned}$$

where \(K(x^k,x^{\star }_f) :=\int _0^1 \nabla ^2{f}(x^k)^{-1/2}\nabla ^2{f}(x^k + t(x^{\star }_f - x^k) \nabla ^2{f}(x^k)^{-1/2}dt\). Using Corollary 2 and noting that \(\bar{\beta }_k := M_f\Vert x^k - x^{\star }_f\Vert _2\), we can estimate \(\Vert K(x^k,x^{\star }_f)\Vert \le \frac{e^{\bar{\beta }_k} - 1}{\bar{\beta }_k}\). Using the two last estimates, and the definition of \(\beta _k\), we can derive

$$\begin{aligned} \begin{array}{ll} \beta _k&= M_f\Vert \nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k)\Vert _2 \le \frac{M_fe^{\bar{\beta }_k - 1}}{\bar{\beta }_k\sqrt{\underline{\sigma }_k}}\Vert x^k - x^{\star }_f\Vert _{x^k} \le M_fe\frac{\Vert x^k - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}}, \end{array} \end{aligned}$$

provided that \(\bar{\beta }_k \le 1\). Since, the step-size \(\tau _k = \frac{1}{\beta _k}\ln (1+\beta _k)\), we have \(1 - \tau _k \le \frac{\beta _k}{2} \le \frac{M_fe\Vert x^k - x^{\star }_f\Vert _{x^k}}{2\sqrt{\underline{\sigma }_k}}\). On the other hand, \(\frac{e^{\bar{\beta }_k}-1 - \bar{\beta }_k}{\bar{\beta }_k^2} \le \frac{e}{2}\) for all \(0\le \bar{\beta }_k \le 1\). Substituting \(T_k\) into (60) and using these relations, we have

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k} \le \tfrac{e}{2}\bar{\beta }_k\Vert x^k - x^{\star }_f\Vert _{x^k} + \tfrac{M_fe}{2}\tfrac{\Vert x^k - x^{\star }_f\Vert _{x^k}^2}{\sqrt{\underline{\sigma }_k}}, \end{aligned}$$

provided that \(\bar{\beta }_k \le 1\). On the other hand, by Proposition 8, we have \(\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le e^{\frac{\bar{\beta }_{k+1} + \bar{\beta }_k}{2}}\Vert x^{k+1} - x^{\star }_f\Vert _{x^k}\) and \(\underline{\sigma }_{k+1}^{-1} \le e^{\bar{\beta }_k + \bar{\beta }_{k+1}}\underline{\sigma }_k^{-1}\). In addition, \(\bar{\beta }_k \le \frac{M_f}{\sqrt{\underline{\sigma }_k}}\Vert x^{k} - x^{\star }_f\Vert _{x^k}\) Combining the above inequalities, we finally get

$$\begin{aligned} \frac{\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}}}{\sqrt{\underline{\sigma }_{k+1}}} \le M_fe^{1+\bar{\beta }_{k+1} + \bar{\beta }_k}\left( \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}} \right) ^2. \end{aligned}$$

Under the fact that \(\beta _k\le 1\), and \(\beta _{k+1} \le 1\), this estimate shows that \(\left\{ \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}}\right\} \) quadratically converges to zero. Since \(\Vert x^k - x^{\star }_f\Vert _2 \le \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}}\), we can also conclude that \(\left\{ \Vert x^k - x^{\star }_f\Vert _2\right\} \) quadratically converges to zero.

\(\mathrm {(b)}\)  For \(\nu = 3\), we can follow [40]. However, for completeness, we give a short proof here. Using Corollary 2, we have \(\left( 1 - r_k + \frac{r_k^2}{3}\right) \nabla ^2{f}(x^k) \preceq \int _0^1\nabla ^2{f}(x^k + t(x^{\star }_f -x^k))dt \preceq \frac{1}{1-r_k}\nabla ^2{f}(x^k)\), where \(r_k := 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k} < 1\). Using the above inequality, we can show that

$$\begin{aligned} T_k \le \max \left\{ r_k - \frac{r_k^2}{3}, \frac{r_k}{1 - r_k}\right\} \Vert x^k - x^{\star }_f\Vert _{x^k} = \frac{0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2}{1 - 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}}. \end{aligned}$$

Substituting \(T_k\) into (60) and using \(\tau _k = \frac{1}{1 + 0.5M_f\lambda _k}\), we have

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k}\le & {} \frac{0.5M_f\lambda _k}{1+0.5M_f\lambda _k}\Vert x^k - x^{\star }_f\Vert _{x^k} \nonumber \\&+ \frac{1}{1 + 0.5M_f\lambda _k}\left( \frac{0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2}{1 - 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}}\right) . \end{aligned}$$

Next, we need to upper bound \(\lambda _k\). Since \(\nabla {f}(x^{\star }_f) = 0\). Using Corollary 2, we can bound \(\lambda _k\) as

$$\begin{aligned} \begin{array}{ll} \lambda _k &{}= \Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = \Vert \nabla ^2{f}(x^k)^{-1/2}(\nabla {f}(x^k) - \nabla {f}(x^{\star }_f))\Vert _2 \\ &{}= \Vert \int _0^1\nabla ^2{f}(x^k)^{-1/2}\nabla ^2f(x^k + t(x^{\star }_f - x^k))(x^{\star }_f - x^k)dt\Vert _2 \\ &{}\le \Vert x^k - x^{\star }_f\Vert _{x^k}\Vert \int _0^1\nabla ^2{f}(x^k)^{-1/2}\nabla ^2f(x^k + t(x^{\star }_f - x^k))\nabla ^2{f}(x^k)^{-1/2}dt\Vert _2 \\ &{}\overset{\tiny \text {Corollary~2}}{\le } \frac{\Vert x^k - x^{\star }_f\Vert _{x^k} }{1 - 0.5M_f \Vert x^k - x^{\star }_f\Vert _{x^k} } \le 2\Vert x^k - x^{\star }_f\Vert _{x^k}, \end{array} \end{aligned}$$

provided that \(M_f\Vert x^k - x^{\star }_f\Vert _{x^k} < 1\). Overestimating the above inequality using this bound, we get

$$\begin{aligned} \begin{array}{ll} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k} &{} \le 0.5M_f\lambda _k\Vert x^k-x_f^{\star }\Vert _{x^k} + \frac{0.5M_f\Vert x^k-x_f^{\star }\Vert _{x^k}^2}{1-0.5M_f\Vert x^k-x_f^{\star }\Vert _{x^k}}\\ &{} \le M_f\Vert x^k-x_f^{\star }\Vert _{x^k}^2+M_f\Vert x^k-x_f^{\star }\Vert _{x^k}^2=2M_f\Vert x^k-x_f^{\star }\Vert _{x^k}^2, \end{array} \end{aligned}$$

provided that \(M_f\Vert x^k-x_f^{\star }\Vert _{x^k} < 1\). On the other hand, we can also estimate \(\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le \frac{\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k}}}{1 - 0.5M_f\left( \Vert x^{k+1} - x^{\star }_f\Vert _{x^{k}} + \Vert x^k - x^{\star }_f\Vert _{x^k}\right) }\). Combining the last two inequalities, we get

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le \frac{2M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2}{ 1 - 2M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2 - 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}} \end{aligned}$$

The right-hand side function \(\psi (t) = \frac{2M_f}{1 - 2M_ft^2 - 0.5M_ft} \le 4M_f\) on \(t \in \left[ 0, \frac{1}{2M_f} \right] \). Hence, if \(\Vert x^k - x^{\star }_f\Vert _{x^k} \le \frac{1}{2M_f}\), then \(\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le 4M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2\). This shows that if \(x^0\in \mathrm {dom}(f)\) is chosen such that \(\Vert x^0 - x^{\star }_f\Vert _{x^0} \le \frac{1}{4M_f}\), then \(\left\{ \Vert x^k - x^{\star }_f\Vert _{x^k}\right\} \) quadratically converges to zero.

\(\mathrm {(c)}\)  For \(\nu \in (2, 3)\), with the same argument as in the proof of Theorem 3, we can show that

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k} \le R_{\nu }(d^k_{\nu })d_{\nu }^k\Vert x^k - x^{\star }_f\Vert _{x^k}, \end{aligned}$$

where \(R_{\nu }\) is defined by (56) and \(d_{\nu }^k := M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \Vert x^k-x^{\star }_f\Vert _2^{3-\nu }\Vert x^k - x^{\star }_f\Vert _{x^k}^{\nu -2}\). Using again the argument as in the proof of Theorem 3, we have

$$\begin{aligned} \frac{\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}}}{\underline{\sigma }_{k+1}^{\frac{3-\nu }{2}}} \le C_{\nu }(d^k_{\nu },\Vert x^k - x^{\star }_f\Vert _{x^k})\left( \frac{\Vert x^k - x^{\star }_f\Vert _{x^k}}{ \underline{\sigma }_k^{\frac{3-\nu }{2}} }\right) ^2. \end{aligned}$$

Here, \(C_{\nu }(\cdot ,\cdot )\) is a given function deriving from \(R_{\nu }\). Under the condition that \(d^k_{\nu }\) and \(\Vert x^k - x^{\star }_f\Vert _{x^k}\) are sufficiently small, we can show that \(C_{\nu }(d^k_{\nu },\Vert x^k - x^{\star }_f\Vert _{x^k}) \le \bar{C}_{\nu }\). Hence, the last inequality shows that \(\Big \{ \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\underline{\sigma }_k^{\frac{3-\nu }{2}} } \Big \}\) quadratically converges to zero. Since \(\underline{\sigma }_k^{\frac{3-\nu }{2}}\Vert x^k -x^{\star }_f\Vert _{H_k} \le \Vert x^k - x^{\star }_f\Vert _{x^k}\), where \(H_k := \nabla ^2{f}(x^k)^{\frac{\nu -2}{2}}\), we have \(\Vert x^k -x^{\star }_f\Vert _{H_k} \le \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\underline{\sigma }_k^{\frac{3-\nu }{2}} }\). Hence, we can conclude that \(\left\{ \Vert x^k -x^{\star }_f\Vert _{H_k}\right\} \) also locally converges to zero at a quadratic rate. \(\square \)

1.6 The proof of Theorem 3: the convergence of the full-step Newton method

We divide this proof into two parts: the quadratic convergence of \(\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}\), and the quadratic convergence of \(\big \{\Vert x^k - x^{\star }_f\Vert _{H_k}\big \}\).

The quadratic convergence of\(\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}\): Since the full-step Newton scheme updates \(x^{k+1} := x^k - \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)\), if we denote by \(n_{\mathrm {nt}}^k = x^{k+1} -x^k = - \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)\), then the last expression leads to \(\nabla {f}(x^k) + \nabla ^2f(x^k)n_{\mathrm {nt}}^k = 0\). In addition, \(\Vert n_{\mathrm {nt}}^k\Vert _{x^k} = \Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = \lambda _k\). Using the definition of \(d_{\nu }(\cdot ,\cdot )\) in (12), we denote \(d^k_{\nu } := d_{\nu }(x^k, x^{k+1})\).

First, by \(\nabla {f}(x^k) + \nabla ^2f(x^k)n_{\mathrm {nt}}^k = 0\) and the mean-value theorem, we can show that

$$\begin{aligned} \nabla {f}(x^{k+1})= & {} \nabla {f}(x^{k+1}) - \nabla {f}(x^k) - \nabla ^2f(x^k)n_{\mathrm {nt}}^k\\= & {} \int _0^1\left[ \nabla ^2{f}(x^k + tn_{\mathrm {nt}}^k) - \nabla ^2{f}(x^k)\right] n_{\mathrm {nt}}^kdt. \end{aligned}$$

Let us define \(G_k := \int _0^1\left[ \nabla ^2{f}(x^k + tn_{\mathrm {nt}}^k) - \nabla ^2{f}(x^k)\right] dt\) and \(H_k := \nabla ^2{f}(x^k)^{-1/2}G_k\nabla ^2{f}(x^k)^{-1/2}\). Then, the above estimate implies \( \nabla {f}(x^{k+1}) = G_kn_{\mathrm {nt}}^k\). Hence, we can show that

$$\begin{aligned} \left[ \Vert \nabla {f}(x^{k+1})\Vert _{x^k}^{*}\right] ^2&= \langle \nabla ^2{f}(x^k)^{-1}G_kn_{\mathrm {nt}}^k, G_kn_{\mathrm {nt}}^k\rangle \\&= \langle H_k\nabla ^2{f}(x^k)^{1/2}n_{\mathrm {nt}}^k, H_k\nabla ^2{f}(x^k)^{1/2}n_{\mathrm {nt}}^k\rangle \nonumber \\&\le \Vert H_k\Vert ^2\Vert n_{\mathrm {nt}}^k \Vert _{x^k}^2 = \Vert H_k\Vert ^2\lambda _k^2. \end{aligned}$$

By Lemma 2, we can estimate

$$\begin{aligned} \Vert H_k\Vert&\le R_{\nu }( d_{\nu }^k )d_{\nu }^k, \end{aligned}$$

where \(R_{\nu }\) is defined by (56). Combining the two last inequalities and using Proposition 8, we consider the following cases:

(a) If \(\nu = 2\), then we have \(\lambda _{k+1}^2 \le e^{d_2^k}\left[ \left\| \nabla {f}(x^{k+1})\right\| _{x^k}^{*}\right] ^2\) which implies \(\lambda _{k+1} \le e^{\frac{d_2^k}{2}}R_2(d_2^k)d_2^k\lambda _k\). Note that \(\lambda _k \ge \frac{\sqrt{\underline{\sigma }_k}d_2^k}{M_f}\) and \(\frac{1}{\underline{\sigma }_{k+1}}\le \frac{e^{d_2^k}}{\underline{\sigma }_k}\). Based on the above inequality, we have

$$\begin{aligned} \frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}}\le M_f R_2(d_2^k)e^{d_2^k}\left( \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right) ^2. \end{aligned}$$

By a numerical calculation, we can easily check that if \(d_2^k < d_2^{\star }\approx 0.12964\), then

$$\begin{aligned} \frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}}\le 2M_f\left( \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right) ^2. \end{aligned}$$

Consequently, if \(\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} < \frac{1}{M_f}\min \left\{ d_2^{\star },0.5\right\} = \frac{d_2^{\star }}{M_f}\), then we can prove

$$\begin{aligned} d_2^{k+1} \le d_2^{k}\text { and }\frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}} \le \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}, \end{aligned}$$

by induction. Under the condition \(\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} < \frac{d_2^{\star }}{M_f}\), the above inequality shows that the ratio \(\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} \) converges to zero at a quadratic rate.

Now, if \(\nu > 2\), then we consider different cases. Note that

$$\begin{aligned} \lambda _{k+1}^2 \le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\left[ \left\| \nabla {f}(x^{k+1})\right\| _{x^k}^{*}\right] ^2, \end{aligned}$$

which follows that

$$\begin{aligned} \lambda _{k+1}\le (1-d_{\nu }^k)^{\frac{-1}{\nu -2}}R_{\nu }(d_{\nu }^k)d_{\nu }^k\lambda _k. \end{aligned}$$
(61)

Note that \(d_{\nu }^k=\left( \frac{\nu }{2}-1\right) M_f\left\| d^k\right\| _2^{3-\nu }\lambda _k^{\nu -2}\) and \(\underline{\sigma }_{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\underline{\sigma }_k^{-1}\). Based on these relations and (61) we can argue as follows:

\(\mathrm {(b)}\) If \(2< \nu < 3\), then \(\lambda _k \ge \left\| d^k\right\| _2\sqrt{\underline{\sigma }_k}\) which follows that \(d_{\nu }^k\le \left( \frac{\nu }{2}-1\right) M_f\underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\). Hence,

$$\begin{aligned} \frac{\lambda _{k+1}}{\underline{\sigma }_{k+1}^{\frac{3-\nu }{2}}}\le (1-d_{\nu }^k)^{-\frac{4-\nu }{\nu -2}}R_{\nu }(d_{\nu }^k)\left( \frac{\nu }{2}-1\right) M_f\left( \frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\right) ^2. \end{aligned}$$

If \(d_{\nu }^k < d_{\nu }^{\star }\), where \(d_{\nu }^{\star }\) is the unique solution to the equation

$$\begin{aligned} \left( \frac{\nu }{2}-1\right) \frac{R_{\nu }(d_{\nu }^k)}{(1-d_{\nu }^k)^{\frac{4-\nu }{\nu -2}}}= 2, \end{aligned}$$

then \(\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1}\le 2M_f\left( \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k \right) ^2\). Note that it is straightforward to check that this equation always admits a positive solution. Hence, if we choose \(x^0\in \mathrm {dom}(f)\) such that \(\underline{\sigma }_0^{-\frac{3-\nu }{2}}\lambda _0 < \frac{1}{M_f}\min \left\{ \frac{2d_{\nu }^{\star }}{\nu -2},\frac{1}{2}\right\} \), then we can prove the following two inequalities together by induction:

$$\begin{aligned} d_{\nu }^k \le d_{\nu }^{k+1}\text { and }\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1} \le \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k. \end{aligned}$$

In addition, the above inequality also shows that \(\left\{ \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\right\} \) quadratically converges to zero.

\(\mathrm {(c)}\) If \(\nu = 3\), then \(d_3^k= \frac{M_f}{2}\lambda _k\), and

$$\begin{aligned} \lambda _{k+1}\le (1-d_3^k)^{-1}R_3(d_3^k)d_3^k\lambda _k=M_f\frac{R_3(d_3^k)}{2(1-d_3^k)}\lambda _k^2. \end{aligned}$$

Directly checking the right-hand side of the above estimate, one can show that if \(d_3^k < d_3^{\star }=0.5\), then \(\lambda _{k+1}\le 2M_f\lambda _k^2\). Hence, if \(\lambda _0 < \frac{1}{M_f}\min \left\{ 2d_3^{\star },0.5\right\} = \frac{1}{2M_f}\), then we can prove the following two inequalities together by induction:

$$\begin{aligned} d_3^{k+1} \le d_3^k\text { and }\lambda _{k+1} \le \lambda _k. \end{aligned}$$

Moreover, the first inequality above also shows that \(\left\{ \lambda _k\right\} \) converges to zero at a quadratic rate.

The quadratic convergence of\(\big \{\Vert x^k - x^{\star }_f\Vert _{H_k}\big \}\): First, using Proposition 9 with \(x:= x^k\) and \(y= x^{\star }_f\), and noting that \(\nabla {f}(x^{\star }_f) = 0\), we have

$$\begin{aligned} \bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f))\Vert x^k - x^{\star }_f\Vert _{x^k}^2 \le \langle \nabla {f}(x^k), x^k - x^{\star }_f\rangle \le \Vert \nabla {f}(x^k)\Vert _{x^k}^{*}\Vert x^k - x^{\star }_f\Vert _{x^k}, \end{aligned}$$

where the last inequality follows from the Cauchy-Schwarz inequality. Hence, we obtain

$$\begin{aligned} \bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f))\Vert x^k - x^{\star }_f\Vert _{x^k} \le \Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = \lambda _k. \end{aligned}$$
(62)

We consider three cases:

(1) When \(\nu = 2\), we have \(\bar{\omega }_{\nu }(\tau ) = \frac{e^\tau -1}{\tau }\). Hence, \(\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \frac{1 - e^{-d_{\nu }(x^k, x^{\star }_f)}}{d_{\nu }(x^k, x^{\star }_f)} \ge 1 - \frac{d_{\nu }(x^k, x^{\star }_f)}{2} \ge \frac{1}{2}\) whenever \(d_{\nu }(x^k, x^{\star }_f) \le 1\). Using this inequality in (62), we have \(\Vert x^k - x^{\star }_f\Vert _{x^k} \le 2\Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = 2\lambda _k\) provided that \(d_{\nu }(x^k, x^{\star }_f) \le 1\). One the other hand, by the definition of \(\underline{\sigma }_k\), we have \(\sqrt{\underline{\sigma }_k}\Vert x^k - x^{\star }_f\Vert _2 \le \Vert x^k - x^{\star }_f\Vert _{x^k}\). Combining the two last inequalities, we obtain \(\Vert x^k - x^{\star }_f\Vert _2 \le \frac{2\lambda _k}{\sqrt{\underline{\sigma }_k}}\) provided that \(d_{\nu }(x^k, x^{\star }_f) \le 1\). Since \(\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} \) locally converges to zero at a quadratic rate, the last relation also shows that \(\big \{\Vert x^k - x^{\star }_f\Vert _2\big \}\) also locally converges to zero at a quadratic rate.

(2) For \(\nu = 3\), we have \(\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \frac{1}{1 + d_{\nu }(x^k, x^{\star }_f)}\) and \(d_{\nu }(x^k, x^{\star }_f) = \frac{M_f}{2}\Vert x^k - x^{\star }_f\Vert _{x^k}\). Hence, from (62), we obtain \(\frac{\Vert x^k - x^{\star }_f\Vert _{x^k} }{1 + 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k} } \le \lambda _k\). This implies \(\Vert x^k - x^{\star }_f\Vert _{x^k} \le \frac{\lambda _k}{1 - 0.5M_f\lambda _k}\) as long as \(0.5M_f\lambda _k < 1\). Clearly, since \(\lambda _k\) locally converges to zero at a quadratic rate, \(\Vert x^k - x^{\star }_f\Vert _{x^k}\) also locally converges to zero at a quadratic rate.

(3) For \(2< \nu < 3\), we have \(\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \left( \frac{\nu -2}{\nu -4}\right) \frac{\left( 1 + d_{\nu }(x^k, x^{\star }_f) \right) ^{\frac{\nu -4}{\nu -2}} - 1}{d_{\nu }(x^k, x^{\star }_f)} \ge 1 - \frac{1}{\nu -2}d_{\nu }(x^k, x^{\star }_f) \ge \frac{1}{2}\) provided that \(d_{\nu }(x^k, x^{\star }_f) < \frac{\nu }{2}-1\). Similar to the case \(\nu = 2\), we have \(\underline{\sigma }_k^{\frac{3-\nu }{2}}\Vert x^k -x^{\star }_f\Vert _{H_k} \le \Vert x^k - x^{\star }_f\Vert _{x^k} \le 2\lambda _k\), where \(H_k := \nabla ^2{f}(x^k)^{\frac{\nu -2}{2}}\). Hence, \(\Vert x^k -x^{\star }_f\Vert _{H_k} \le \frac{2\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\). Since \(\big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\big \}\) locally converges to zero at a quadratic rate, \(\big \{\Vert x^k -x^{\star }_f\Vert _{H_k} \big \}\) also locally converges to zero at a quadratic rate. \(\square \)

1.7 The proof of Theorem 5: convergence of the damped-step PN method

Given \(H\in \mathcal {S}^p_{++}\) and a proper, closed, and convex function \(g : \mathbb {R}^p\rightarrow \mathbb {R}\cup \{+\,\infty \}\), we define

$$\begin{aligned} \mathcal {P}_{H}^g(u):=(H+\partial g)^{-1}(u) = \mathrm {arg}\min _{x}\left\{ g(x) + \tfrac{1}{2}\left\langle Hx,x\right\rangle -\left\langle u,x\right\rangle \right\} . \end{aligned}$$

If \(H= \nabla ^2{f}(x)\) is the Hessian mapping of a strictly convex function f, then we can also write \(\mathcal {P}_{\nabla ^2 f(x)}(u)\) shortly as \(\mathcal {P}_{x}(u)\) for our notational convenience. The following lemma will be used in the sequel whose proof can be found in [62].

Lemma 3

Let \(g : \mathbb {R}^p\rightarrow \mathbb {R}\cup \{+\,\infty \}\) be a proper, closed, and convex function, and \(H\in \mathcal {S}^p_{++}\). Then, the mapping \(\mathcal {P}_{H}^g\) defined above is non-expansive with respect to the weighted norm defined by \(H\), i.e., for any \(u,v\in \mathbb {R}^p\), we have

$$\begin{aligned} \left\| \mathcal {P}^g_{H}(u)-\mathcal {P}^g_{H}(v)\right\| _{H} \le \left\| u-v\right\| ^{*}_{H}. \end{aligned}$$
(63)

Let us define

$$\begin{aligned} S_{x}(u):=\nabla ^2 f(x)u-\nabla f(u)~~~\text {and}~~~e_{x}(u,v):=[\nabla ^2 f(x)-\nabla ^2 f(u)](v-u), \end{aligned}$$
(64)

for any vectors \(x,u\in \mathrm {dom}(f)\) and \(v\in \mathbb {R}^p\). We now prove Theorem 5 in the main text.

The proof of Theorem 5

Computing the step-size\(\tau _k\): Since \(z^k\) satisfies the optimality condition (36), we have

$$\begin{aligned} -\nabla f(x^k) - \nabla ^2 f(x^k)n_{\mathrm {pnt}}^k \in \partial {g}(z^k). \end{aligned}$$

Using Proposition 10 we obtain

$$\begin{aligned} f(x^{k+1}) \le f(x^k) + \tau _k\left\langle \nabla f(x^k),n_{\mathrm {pnt}}^k\right\rangle + \omega _{\nu }(\tau _kd_k)\tau _k^2\lambda _k^2. \end{aligned}$$

Since \(x^{k+1}=(1-\tau _k)x^k+\tau _kz^k\), using this relation and the convexity of g, we have

$$\begin{aligned} g(x^{k+1})\le g(x^k)-\tau _k\left\langle \nabla f(x^k)+\nabla ^2 f(x^k)n_{\mathrm {pnt}}^k, n_{\mathrm {pnt}}^k\right\rangle . \end{aligned}$$

Summing up the last two inequalities, we obtain the following estimate

$$\begin{aligned} F(x^{k+1}) \le F(x^k) - \eta _k(\tau _k). \end{aligned}$$

With the same argument as in the proof of Theorem 2, we obtain the conclusion of Theorem 5.

The proof of local quadratic convergence We consider the distance between \(x^{k+1}\) and \(x^{\star }\) measured by \(\Vert x^{k+1}-x^{\star }\Vert _{x^{\star }}\). By the definition of \(x^{k+1}\), we have

$$\begin{aligned} \Vert x^{k+1} - x^{\star }\Vert _{x^{\star }}\le (1-\tau _k)\Vert x^k-x^{\star }\Vert _{x^{\star }}+\tau _k\Vert z^k-x^{\star }\Vert _{x^{\star }}. \end{aligned}$$
(65)

Using the new notations in (64), it follows from the optimality condition (33) and (36) that \(z^k = \mathcal {P}^g_{x^{\star }}(S_{x^{\star }}(x^k)+e_{x^{\star }}(x^k,z^k))\) and \(x^{\star }=\mathcal {P}^g_{x^{\star }}(S_{x^{\star }}(x^{\star }))\). By Lemma 3 and the triangle inequality, we can show that

$$\begin{aligned} \Vert z^k-x^{\star }\Vert _{x^{\star }}\le \Vert S_{x^{\star }}(x^k) - S_{x^{\star }}(x^{\star })\Vert ^{*}_{x^{\star }} + \Vert e_{x^{\star }}(x^k,z^k)\Vert ^{*}_{x^{\star }}. \end{aligned}$$
(66)

By following the same argument as in [62], if we apply Lemma 2, then we can derive

$$\begin{aligned} \Vert S_{x^{\star }}(x^k) - S_{x^{\star }}(x^{\star }) \Vert ^{*}_{x^{\star }} \le R_{\nu }(d_{\nu }(x^{\star },x^k))d_{\nu }(x^{\star },x^k)\Vert x^k-x^{\star }\Vert _{x^{\star }}, \end{aligned}$$
(67)

where \(R_{\nu }(\cdot )\) is defined by (56).

Next, using the same argument as the proof of (72) in Theorem 6 below, we can bound the second term \(\Vert e_{x^{\star }}(x^k,z^k) \Vert ^{*}_{x^{\star }}\) of (66) as

$$\begin{aligned} \Vert e_{x^{\star }}(x^k,z^k) \Vert _{x^{\star }}^{*} \le {\left\{ \begin{array}{ll} \big [(1-d_{\nu }(x^{\star },x^k))^{\frac{-2}{\nu -2}}-1 \big ] \Vert z^k - x^k \Vert _{x^{\star }}, ~~~&{}\quad \text {if}\, \nu > 2 \\ \big (e^{d_{\nu }(x^{\star },x^k)} - 1 \big ) \Vert z^k - x^k \Vert _{x^{\star }} &{}\quad \text {if}\,\, \nu = 2. \end{array}\right. } \end{aligned}$$

Combining this inequality, (66), (67), and the triangle inequality, we obtain

$$\begin{aligned} \left\{ \begin{array}{l} \Vert z^k - x^k \Vert _{x^{\star }} \le \hat{R}_{\nu }( d_{\nu }(x^{\star },x^k)) \Vert x^k-x^{\star } \Vert _{x^{\star }},\\ \Vert z^k - x^{\star } \Vert _{x^{\star }} \le \tilde{R}_{\nu }( d_{\nu }(x^{\star },x^k)) d_{\nu }(x^{\star },x^k) \Vert x^k-x^{\star } \Vert _{x^{\star }}, \end{array}\right. \end{aligned}$$
(68)

where \(\hat{R}_{\nu }\) and \(\tilde{R}_{\nu }\) are defined as

$$\begin{aligned} \hat{R}_{\nu }( t ) := {\left\{ \begin{array}{ll} \frac{tR_{\nu }(t)+1}{2-(1-t)^{\frac{-2}{\nu -2}}}, ~~~&{}\text {if }\,\,\nu> 2 \\ \frac{tR_{\nu }(t)+1}{2-e^{t}} &{}\text {if}\,\, \nu = 2 \end{array}\right. } \quad \text {and} \quad \tilde{R}_{\nu }(t) := {\left\{ \begin{array}{ll} \frac{tR_{\nu }(t)+(1-t)^{\frac{-2}{\nu -2}}-1}{t\left( 2-(1-t)^{\frac{-2}{\nu -2}}\right) }, ~~~&{}\text {if }\,\,\nu > 2 \\ \frac{tR_{\nu }(t)+e^t-1}{t(2-e^t)}&{}\text {if}\,\, \nu = 2,\end{array}\right. } \end{aligned}$$

respectively. After a few simple calculations, one can show that there exists a constant \(c_{\nu } \in (0, +\,\infty )\) such that if \(t\in [0,\bar{d}_{\nu }]\), then both \(\hat{R}_{\nu }(t)\) and \(\tilde{R}_{\nu }(t)\in [0,c_{\nu }]\) (when \(t \rightarrow 0+\), consider the limit), where \(\bar{d}_2:=\frac{3}{5}\) and \(\bar{d}_{\nu }:= 1-\left( \frac{2}{3}\right) ^{\frac{\nu -2}{2}}\) for \(\nu > 2\), respectively. Using this bound, (65), (68), and the fact that \(\tau _k \le 1\), we can bound

$$\begin{aligned} \Vert x^{k+1}-x^{\star } \Vert _{x^{\star }} \le \left[ (1 - \tau _k) + c_{\nu } d_{\nu }(x^{\star },x^k)\right] \Vert x^k-x^{\star }\Vert _{x^{\star }}. \end{aligned}$$
(69)

Let \(\underline{\sigma }^{\star } := \sigma _{\min }(\nabla ^2 f(x^{\star }))\) be the smallest eigenvalue of \(\nabla ^2 f(x^{\star })\). We consider the following cases:

(a) If \(\nu =2\), then, for \(0 \le d_{\nu }(x^{\star }, x^k) \le \bar{d}_{\nu }\), we can bound \(1-\tau _k\) as

$$\begin{aligned} \begin{array}{ll} 1-\tau _k &{} = 1-\frac{\ln (1+\beta _k)}{\beta _k} \le \frac{\beta _k}{2} = \frac{M_f}{2}\Vert z^k - x^k\Vert _2 \\ &{}\le \frac{M_f}{2}\frac{ \Vert z^k - x^k\Vert _{x^{\star }}}{\sqrt{\underline{\sigma }^{\star }}} \overset{\tiny (68)}{\le } \frac{c_{\nu } M_f}{2\sqrt{\underline{\sigma }^{\star }}} \Vert x^k-x^{\star }\Vert _{x^{\star }}. \end{array} \end{aligned}$$

On the other hand, we have \(d_{\nu }(x^{\star },x^k)=M_f\Vert x^k - x^{\star } \Vert _2 \le \frac{M_f}{\sqrt{\underline{\sigma }^{\star }}}\Vert x^k-x^{\star }\Vert _{x^{\star }}\). Using these estimates into (69), we get

$$\begin{aligned} \Vert x^{k+1} - x^{\star } \Vert _{x^{\star }}\le & {} \left( \frac{c_{\nu }M_f}{2\sqrt{\underline{\sigma }^{\star }}}\Vert x^k-x^{\star }\Vert _{x^{\star }} + \frac{c_{\nu }M_f}{\sqrt{\underline{\sigma }^{\star }}}\Vert x^k-x^{\star }\Vert _{x^{\star }}\right) \Vert x^k-x^{\star }\Vert _{x^{\star }}\\= & {} \frac{3c_{\nu }M_f}{2\sqrt{\underline{\sigma }^{\star }}} \Vert x^k-x^{\star }\Vert _{x^{\star }}^2. \end{aligned}$$

Let \(c^{\star }_{\nu } := \frac{3c_{\nu }M_f}{2\sqrt{\underline{\sigma }^{\star }}}\). The last estimate shows that if \(\Vert x^0 - x^{\star }\Vert _{x^{\star }} \le \min \left\{ \frac{ \bar{d}_{\nu }\sqrt{\underline{\sigma }^{\star }}}{M_f}, \frac{1}{c^{\star }_{\nu }}\right\} \), then \(\left\{ \Vert x^k - x^{\star }\Vert _{x^{\star }}\right\} \) quadratically converges to zero.

(b) If \(2 < \nu \le 3\), then we first show that

$$\begin{aligned} d_{\nu }(x^{\star },x^k)=\left( \tfrac{\nu }{2}-1\right) M_f\Vert x^k - x^{\star }\Vert _2^{3-\nu }\Vert x^k - x^{\star }\Vert _{x^{\star }}^{\nu -2} \le \left( \tfrac{\nu }{2}-1\right) \frac{M_f}{\left( \underline{\sigma }^{\star }\right) ^{\frac{3-\nu }{2}}}\Vert x^k-x^{\star }\Vert _{x^{\star }}. \end{aligned}$$

Hence, if \(\Vert x^k-x^{\star }\Vert _{x^{\star }}\le m_{\nu }\bar{d}_{\nu }\), where \(m_{\nu } := \tfrac{2}{\nu -2}\frac{\left( \underline{\sigma }^{\star }\right) ^{\frac{3-\nu }{2}}}{M_f}\), then \(d_{\nu }(x^{\star },x^k)\le \bar{d}_{\nu }\). Next, using the definition of \(d_k\) in (28), we can bound it as

$$\begin{aligned} \begin{array}{ll} d_k &{} = M_f\left( \frac{\nu }{2}-1\right) \Vert z^k-x^k \Vert _{x^k}^{\nu -2}\Vert z^k - x^k\Vert _2^{3-\nu }\\ &{}\overset{(15)}{\le } M_f\left( \frac{\nu }{2}-1\right) \left[ \frac{ \Vert z^k - x^k\Vert _{x^{\star }}}{(1-d_{\nu }(x^{\star },x^k))^{\frac{1}{\nu -2}}}\right] ^{\nu -2}\frac{\Vert z^k - x^k\Vert _{x^{\star }}^{3-\nu }}{(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}} \\ \nonumber &{} \le \frac{M_f}{(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}\left( \frac{\nu }{2}-1\right) \Vert z^k - x^k\Vert _{x^{\star }} \overset{\tiny (68)}{\le } \frac{M_f(\nu -2)}{2(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}c_{\nu }\Vert x^k - x^{\star } \Vert _{x^{\star }}. \end{array} \end{aligned}$$

Using this estimate, we can bound \(1-\tau _k\) as follows:

$$\begin{aligned} \begin{array}{ll} 1-\tau _k &{} = 1-\frac{1}{d_k}+\frac{1}{d_k}\left( 1-\frac{\frac{4-\nu }{\nu -2}d_k}{1+\frac{4-\nu }{\nu -2}d_k}\right) ^{\frac{\nu -2}{4-\nu }}\\ &{}\overset{\tiny \text {Bernoulli's inequality}}{\le } 1 - \frac{1}{d_k}+\frac{1}{d_k}\left( 1-\frac{\nu -2}{4-\nu }\frac{\frac{4-\nu }{\nu -2}d_k}{1+\frac{4-\nu }{\nu -2}d_k}\right) \\ \nonumber &{} = \frac{\frac{4-\nu }{\nu -2}d_k}{1+\frac{4-\nu }{\nu -2}d_k} \le \frac{4-\nu }{\nu -2}d_k \le \frac{M_f(4 -\nu )}{2(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}c_{\nu }\Vert x^k - x^{\star } \Vert _{x^{\star }} = n_{\nu }\Vert x^k - x^{\star }\Vert _{x^{\star }}, \end{array} \end{aligned}$$

where \(n_{\nu } := \frac{(4 -\nu )M_f}{2(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}c_{\nu } > 0\). Substituting this estimate into (69) and noting that \(d_{\nu }(x^{\star }, x^k) \le \frac{1}{m_{\nu }}\Vert x^k - x^{\star }\Vert _{x^{\star }}\), we get

$$\begin{aligned} \Vert x^{k+1} - x^{\star }\Vert _{x^{\star }} \le \left( n_{\nu } + \frac{c_{\nu }}{m_{\nu }}\right) \Vert x^k - x^{\star }\Vert _{x^{\star }}^2 := c^{*}_{\nu }\Vert x^k - x^{\star }\Vert _{x^{\star }}^2. \end{aligned}$$

Hence, if \(\Vert x^0 - x^{\star }\Vert _{x^{\star }} \le \min \left\{ m_{\nu }\bar{d}_{\nu }, \frac{1}{c^{\star }_{\nu }}\right\} \), then the last estimate shows that the sequence \(\left\{ \Vert x^k - x^{\star }\Vert _{x^{\star }}\right\} \) quadratically converges to zero.

In summary, there exists a neighborhood \(\mathcal {N}(x^{\star })\) of \(x^{\star }\), such that if \(x^0\in \mathcal {N}(x^{\star })\cap \mathrm {dom}(F)\), then the whole sequence \(\left\{ \Vert x^k-x^{\star }\Vert _{x^{\star }}\right\} \) quadratically converges to zero. \(\square \)

1.8 The proof of Theorem 6: locally quadratic convergence of the PN method

Since \(z^k\) is the optimal solution to (35) which satisfies (36), we have \(\nabla ^2 f(x^k)x^k-\nabla f(x^k)\in (\nabla ^2 f(x^k) + \partial g)(z^k)\). Using this optimality condition, we get

$$\begin{aligned} \begin{array}{lll} x^{k+1} &{}=z^k &{}= \mathcal {P}^g_{x^k}(S_{x^k}(x^k)+e_{x^k}(x^k,z^k))~~~~~\text { and }\\ x^{k+2} &{}=z^{k+1} &{}= \mathcal {P}^g_{x^k}(S_{x^k}(x^{k+1})+e_{x^k}(x^{k+1},z^{k+1})). \end{array} \end{aligned}$$

Let us define \(\tilde{\lambda }_{k+1}:=\Vert n_{\mathrm {pnt}}^{k+1}\Vert _{x^k}\). Then, by Lemma (3) and the triangular inequality, we have

$$\begin{aligned} \begin{array}{lll} \tilde{\lambda }_{k+1} &{} \le &{} \left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}+\left\| e_{x^k}(x^{k+1},z^{k+1})-e_{x^k}(x^k,z^k)\right\| _{x^k}^{*} \\ &{} = &{} \left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}+\left\| e_{x^k}(x^{k+1},z^{k+1})\right\| _{x^k}^{*}. \end{array} \end{aligned}$$
(70)

Let us first bound the term \(\left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}\) as follows:

$$\begin{aligned} \left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}\le R_{\nu }(d_{\nu }^k)d_{\nu }^k\lambda _k, \end{aligned}$$
(71)

where \(R_{\nu }(t)\) is defined as (56). Indeed, from the mean-value theorem, we have

$$\begin{aligned} \left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}= & {} \left\| \int _0^1 [\nabla ^2 f(x^k+tn_{\mathrm {pnt}}^k)-\nabla ^2 f(x^k)]n_{\mathrm {pnt}}^k\mathrm {d}t\right\| _{x^k} \\\le & {} \left\| H(x^k,x^{k+1})\right\| \lambda _k, \end{aligned}$$

where \(H\) is defined as (54). Combining the above inequality and (56) in Lemma 2, we get (71).

Next we bound the term \(\left\| e_{x^k}(x^{k+1},z^{k+1})\right\| _{x^k}^{*}\) as follows:

$$\begin{aligned} \left\| e_{x^k}(x^{k+1}, z^{k+1})\right\| _{x^k} \le {\left\{ \begin{array}{ll} \big [(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}-1 \big ]\tilde{\lambda }_{k+1}, ~~~&{}\text {if}\, \nu > 2\\ (e^{d_{\nu }^k}-1)\tilde{\lambda }_{k+1} &{}\text {if}\,\, \nu = 2. \end{array}\right. } \end{aligned}$$
(72)

Note that

$$\begin{aligned} \left\| e_{x^k}(x^{k+1},z^{k+1})\right\| _{x^k}^{*}= & {} \left\| [\nabla ^2 f(x^k)-\nabla ^2 f(x^{k+1})](z^{k+1}-x^{k+1})\right\| _{x^k}^{*} \\\le & {} \Vert \widetilde{H}(x^k,x^{k+1})\Vert \tilde{\lambda }_{k+1}, \end{aligned}$$

where

$$\begin{aligned} \widetilde{H}(x,y):= & {} \nabla ^2 f(x)^{-1/2}\left( \nabla ^2 f(x)-\nabla ^2 f(y)\right) \nabla ^2 f(x)^{-1/2} \\= & {} \mathbb {I}- \nabla ^2{f}(x)^{-1/2}\nabla ^2{f}(y) \nabla ^2{f}(x)^{-1/2}. \end{aligned}$$

By Proposition 8, we have

$$\begin{aligned} \Vert \widetilde{H}(x,y)\Vert \le {\left\{ \begin{array}{ll} \max \left\{ 1-(1-d_{\nu }(x,y))^{\frac{2}{\nu -2}},(1-d_{\nu }(x,y))^{\frac{-2}{\nu -2}}-1\right\} , ~~~&{}\text {if}\, \nu > 2 \\ \max \left\{ 1-e^{-d_{\nu }(x,y)},e^{d_{\nu }(x,y)}-1\right\} &{}\text {if}\,\, \nu = 2. \end{array}\right. } \end{aligned}$$

This inequality can be simplified as

$$\begin{aligned} \Vert \widetilde{H}(x,y)\Vert \le {\left\{ \begin{array}{ll} (1-d_{\nu }(x,y))^{\frac{-2}{\nu -2}}-1, ~~~&{}\text {if }\,\nu > 2 \\ e^{d_{\nu }(x,y)}-1 &{}\text {if}\,\, \nu = 2. \end{array}\right. } \end{aligned}$$
(73)

Hence, the inequality (72) holds.

Now, we combine (70), (71), and (72), if \(\nu = 2\), and assuming that \(d_2^k < \ln 2\), then we get

$$\begin{aligned} \tilde{\lambda }_{k+1}\le \frac{R_2(d_2^k)d_2^k}{2-e^{d_2^k}}\lambda _k. \end{aligned}$$

By Proposition 8, we have \(\lambda _{k+1}^2\le e^{d_{\nu }^k}\tilde{\lambda }_{k+1}^2\). Combining this estimate and the last inequality, we get

$$\begin{aligned} \lambda _{k+1}\le \frac{R_2(d_2^k)d_2^ke^{\frac{d_2^k}{2}}}{2-e^{d_2^k}}\lambda _k. \end{aligned}$$
(74)

Note that \(\lambda _k \ge \frac{\sqrt{\underline{\sigma }_k}d_2^k}{M_f}\) and \(\underline{\sigma }_{k+1}^{-1}\le e^{d_2^k}\underline{\sigma }_k^{-1}\). It follows from (74) that

$$\begin{aligned} \frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}}\le M_f\frac{R_2(d_2^k)e^{d_2^k}}{2-e^{d_2^k}}\left( \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right) ^2. \end{aligned}$$

By a numerical calculation, we can check that if \(d_2^k \le d_2^{\star }\approx 0.35482\), then

$$\begin{aligned} \frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}}\le 2M_f\left( \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right) ^2. \end{aligned}$$

Hence, if we choose \(x^0\in \mathrm {dom}(F)\) such that \(\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} \le \frac{1}{M_f}\min \left\{ d_2^{\star },0.5\right\} = \frac{d_2^{\star }}{M_f}\), then we can prove the following two inequalities together by induction:

$$\begin{aligned} d_2^{k+1} \le d_2^{k}~~~~\text {and}~~~~\frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}} \le \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}. \end{aligned}$$

These inequalities show the nonincreasing monotonicity of \(\left\{ d_2^k\right\} \) and \(\left\{ \lambda _k\right\} \). The above inequality also shows the local quadratic convergence of the sequence \(\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} \).

Now, if \(\nu > 2\) and assume that \(d_{\nu }^k < 1- \left( {\frac{1}{2}}\right) ^{\frac{\nu -2}{2}}\), then

$$\begin{aligned} \tilde{\lambda }_{k+1}\le \frac{R_{\nu }(d_{\nu }^k)d_{\nu }^k}{2-(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}}\lambda _k. \end{aligned}$$

By Proposition 8, we have \(\lambda _{k+1}^2 \le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\tilde{\lambda }_{k+1}^2\). Hence, combining these inequalities, we get

$$\begin{aligned} \lambda _{k+1}\le \frac{R_{\nu }(d_{\nu }^k)d_{\nu }^k(1-d_{\nu }^k)^{\frac{-1}{\nu -2}}}{2-(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}}\lambda _k. \end{aligned}$$
(75)

Note that \(d_{\nu }^k=\left( \frac{\nu }{2}-1\right) M_f\left\| p^k\right\| _2^{3-\nu }\lambda _k^{\nu -2}\), \(\underline{\sigma }_{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\underline{\sigma }_k^{-1}\) and \(\sigma _{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\sigma _k^{-1}\). Using these relations and (75), we consider two cases:

(a) If \(\nu = 3\), then \(d_3^k = \frac{M_f}{2}\lambda _k\), and

$$\begin{aligned} \lambda _{k+1}\le \frac{R_3(d_3^k)(1-d_3^k)^{-1}}{2-(1-d_3^k)^{-2}}d_3^k\lambda _k=M_f\frac{R_3(d_3^k)(1-d_3^k)^{-1}}{2\left( 2-(1-d_3^k)^{-2}\right) }\lambda _k^2. \end{aligned}$$

By a simple numerical calculation, we can show that if \(d_3^k \le d_3^{\star }\approx 0.20943\), then \(\lambda _{k+1}\le 2M_f\lambda _k^2\). Hence, if \(\lambda _0 < \frac{1}{M_f}\min \left\{ 2d_3^{\star },0.5\right\} = \frac{2}{M_f}d_3^{\star }\), then we can prove the following two inequalities together by induction

$$\begin{aligned} d_3^{k+1} \le d_3^k\text { and }\lambda _{k+1} \le \lambda _k. \end{aligned}$$

These inequalities show the non-increasing monotonicity of \(\left\{ d_2^k\right\} \) and \(\left\{ \lambda _k\right\} \). The above inequality also shows the quadratic convergence of the sequence \(\left\{ \lambda _k\right\} \).

(b) If \(2< \nu < 3\), then \(\lambda _k \ge \Vert p^k\Vert _2\sqrt{\underline{\sigma }_k}\) which implies that \(d_{\nu }^k\le \left( \frac{\nu }{2}-1\right) M_f\underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\). Hence, we have

$$\begin{aligned} \frac{\lambda _{k+1}}{\underline{\sigma }_{k+1}^{\frac{3-\nu }{2}}}\le \frac{R_{\nu }(d_{\nu }^k)(1-d_{\nu }^k)^{-\frac{4-\nu }{\nu -2}}}{2-(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}}\left( \frac{\nu }{2}-1\right) M_f\left( \frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\right) ^2. \end{aligned}$$

If \(d_{\nu }^k < d_{\nu }^{\star }\), then \(\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1}\le 2M_f\left( \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k \right) ^2\), where \(d_{\nu }^{\star }\) is the unique solution to the equation

$$\begin{aligned} \frac{R_{\nu }(d_{\nu }^k)(1-d_{\nu }^k)^{-\frac{4-\nu }{\nu -2}}}{2-(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}}\left( \frac{\nu }{2}-1\right) = 2. \end{aligned}$$

Note that it is straightforward to check that this equation always admits a positive solution. Therefore, if \(\underline{\sigma }_0^{-\frac{3-\nu }{2}}\lambda _0 \le \frac{1}{M_f}\min \left\{ \frac{2d_{\nu }^{\star }}{\nu -2},\frac{1}{2}\right\} \), then we can prove the following two inequalities together by induction:

$$\begin{aligned} d_{\nu }^k \le d_{\nu }^{k+1}\text { and }\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1} \le \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k. \end{aligned}$$

These inequalities show the non-increasing monotonicity of \(\left\{ d_2^k\right\} \) and \(\left\{ \lambda _k\right\} \). The above inequality also shows the quadratic convergence of the sequence \(\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}\).

Finally, to prove the local quadratic convergence of \(\left\{ x^k\right\} \) to \(x^{\star }\), we use the same argument as in the proof of Theorem 3 and Theorem 5, where we omit the details here. \(\square \)

1.9 The proof of Theorem 7: convergence of the quasi-Newton method

The full-step quasi-Newton method for solving (24) can be written as \(x^{k+1} = x^k - B_k\nabla {f}(x^k)\). This is equivalent to \(H_k(x^{k+1} - x^k) + \nabla {f}(x^k) = 0\). Using this relation and \(\nabla {f}(x^{\star }_f) = 0\), we can write

$$\begin{aligned} x^{k+1} - x^{\star }_f= & {} \nabla ^2{f}(x^{\star }_f)^{-1}\left[ \nabla ^2{f}(x^{\star }_f)(x^k - x^{\star }_f) + \left( \nabla ^2{f}(x^{\star }_f) - H_k\right) (x^{k+1} - x^k)\right. \nonumber \\&\left. - \nabla {f}(x^k) + \nabla {f}(x^{\star }_f)\right] . \end{aligned}$$
(76)

We first consider \(T_k := \Vert \nabla ^2{f}(x^{\star }_f)^{-1}\left[ \nabla {f}(x^k) - \nabla {f}(x^{\star }_f) - \nabla ^2{f}(x^{\star }_f)(x^k - x^{\star }_f) \right] \Vert _{x^{\star }_f}\). Similar to the proof of Theorem 3, we can show that

$$\begin{aligned} T_k= & {} \Big \Vert \int _0^1\nabla ^2{f}(x^{\star }_f)^{-1}\left[ \nabla ^2{f}(x^{\star }_f + t(x^k - x^{\star }_f)) - \nabla ^2{f}(x^{\star }_f)\right] (x^k - x^{\star }_f)\Big \Vert _{x^{\star }_f} \nonumber \\\le & {} R_{\nu }( d_{\nu }^k )d_{\nu }^k\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \end{aligned}$$
(77)

where \(R_{\nu }\) is defined by (56) and \(d_{\nu }^k := M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \Vert x^k-x^{\star }_f\Vert _2^{3-\nu }\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}^{\nu -2}\). Moreover, we note that

$$\begin{aligned} S_k:= & {} \Vert \nabla ^2{f}(x^{\star }_f)^{-1}\left( H_k - \nabla ^2{f}(x^{\star }_f)\right) (x^{k+1} - x^k)\Vert _{x^{\star }_f}\\= & {} \Vert \left( H_k - \nabla ^2{f}(x^{\star })\right) (x^{k+1} - x^k)\Vert ^{*}_{x^{\star }_f} \end{aligned}$$

Combining this estimate, (76), and (77), we can derive

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^{\star }_f} \le R_{\nu }( d_{\nu }^k )d_{\nu }^k\Vert x^k-x^{\star }_f\Vert _{x^{\star }_f} + \Vert \left( H_k - \nabla ^2{f}(x^{\star }_f)\right) (x^{k+1} - x^k)\Vert ^{*}_{x^{\star }_f}. \end{aligned}$$
(78)

First, we prove statement (a). Indeed, from the Dennis–Moré condition (41), we have

$$\begin{aligned} \Vert \left( H_k - \nabla ^2{f}(x^{\star }_f)\right) (x^{k+1} - x^k)\Vert ^{*}_{x^{\star }_f}\le & {} \gamma _k\Vert x^{k+1} -x_k\Vert _{x^{\star }_f}\\\le & {} \gamma _k\left( \Vert x^{k+1} - x^{\star }_f\Vert _{x^{\star }_f} + \Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\right) , \end{aligned}$$

where \(\lim _{k\rightarrow \infty }\gamma _k = 0\). Substituting this estimate into (78), and noting that \(\Vert x^k - x^{\star }_f\Vert _2 \le \frac{1}{\underline{\sigma }^{\star }}\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\), where \(\underline{\sigma }^{\star } := \lambda _{\min }(\nabla ^2{f}(x^{\star }_f)) > 0\), we can show that

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^{\star }_f} \le \frac{1}{1-\gamma _k}\left( R_{\nu }^{\star }\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}^2 + \gamma _k\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\right) , \end{aligned}$$
(79)

provided that \(\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \le \bar{r}\) and \(R_{\nu }^{\star } := \max \left\{ R_{\nu }(d_{\nu }^k) \mid \Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \le \bar{r}\right\} < +\,\infty \). Here, \(\bar{r} > 0\) is a given value such that \(R_{\nu }^{\star }\) is finite. The estimate (79) shows that if \(\bar{r}\) is sufficiently small, \(\left\{ \Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\right\} \) superlinearly converges to zero. Finally, the statement (b) is proved similarly by combining statement (a) and [62, Theorem 11]. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, T., Tran-Dinh, Q. Generalized self-concordant functions: a recipe for Newton-type methods. Math. Program. 178, 145–213 (2019). https://doi.org/10.1007/s10107-018-1282-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-018-1282-4

Keywords

Mathematics Subject Classification

Navigation