Generalized self-concordant functions: a recipe for Newton-type methods

Sun, Tianxiao; Tran-Dinh, Quoc

doi:10.1007/s10107-018-1282-4

Generalized self-concordant functions: a recipe for Newton-type methods

Full Length Paper
Series A
Published: 08 May 2018

Volume 178, pages 145–213, (2019)
Cite this article

Mathematical Programming Submit manuscript

1591 Accesses
11 Citations
Explore all metrics

Abstract

We study the smooth structure of convex functions by generalizing a powerful concept so-called self-concordance introduced by Nesterov and Nemirovskii in the early 1990s to a broader class of convex functions which we call generalized self-concordant functions. This notion allows us to develop a unified framework for designing Newton-type methods to solve convex optimization problems. The proposed theory provides a mathematical tool to analyze both local and global convergence of Newton-type methods without imposing unverifiable assumptions as long as the underlying functionals fall into our class of generalized self-concordant functions. First, we introduce the class of generalized self-concordant functions which covers the class of standard self-concordant functions as a special case. Next, we establish several properties and key estimates of this function class which can be used to design numerical methods. Then, we apply this theory to develop several Newton-type methods for solving a class of smooth convex optimization problems involving generalized self-concordant functions. We provide an explicit step-size for a damped-step Newton-type scheme which can guarantee a global convergence without performing any globalization strategy. We also prove a local quadratic convergence of this method and its full-step variant without requiring the Lipschitz continuity of the objective Hessian mapping. Then, we extend our result to develop proximal Newton-type methods for a class of composite convex minimization problems involving generalized self-concordant functions. We also achieve both global and local convergence without additional assumptions. Finally, we verify our theoretical results via several numerical examples, and compare them with existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relaxed Inertial Method for Solving Split Monotone Variational Inclusion Problem with Multiple Output Sets Without Co-coerciveness and Lipschitz Continuity

Article 15 April 2024

Efficiency of higher-order algorithms for minimizing composite functions

Article 10 October 2023

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

References

Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
MathSciNet MATH Google Scholar
Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15(1), 595–627 (2014)
MathSciNet MATH Google Scholar
Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding agorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
MathSciNet MATH Google Scholar
Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Comput. 3(3), 165–218 (2011)
MathSciNet MATH Google Scholar
Becker, S., Fadili, M.J.: A quasi-Newton proximal splitting method. In: Proceedings of Neutral Information Processing Systems Foundation (NIPS) (2012)
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization (2016). arXiv preprint arXiv:1609.08502
Bonnans, J.F.: Local analysis of Newton-type methods for variational inequalities and nonlinear programming. Appl. Math. Optim. 29, 161–186 (1994)
MathSciNet MATH Google Scholar
Borodin, A., El-Yaniv, R., Gogan, V.: Can we learn to beat the best stock. J. Artif. Intell. Res. (JAIR) 21, 579–594 (2004)
MathSciNet MATH Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
MATH Google Scholar
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
MathSciNet MATH Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Google Scholar
Chen, T.-Y., Demmel, J.W.: Balancing sparse matrices for computing eigenvalues. Linear Algebra Appl. 309(1–3), 261–287 (2000)
MathSciNet MATH Google Scholar
Cohen, M., Madry, A., Tsipras, D., Vladu, A.: Matrix scaling and balancing via box constrained Newton’s method and interior-point methods. The 58th Annual IEEE Symposium on Foundations of Computer Science, pp. 902–913 (2017)
Dennis, J.E., Moré, J.J.: A characterisation of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28, 549–560 (1974)
MATH Google Scholar
Deuflhard, P.: Newton Methods for Nonlinear Problems—Affine Invariance and Adaptative Algorithms, volume 35 of Springer Series in Computational Mathematics, 2nd edn. Springer, Berlin (2006)
Google Scholar
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91, 201–213 (2002)
MathSciNet MATH Google Scholar
Erdogdu, M.A., Montanari, A.: Convergence rates of sub-sampled Newton methods. In: Advances in Neural Information Processing Systems, pp. 3052–3060 (2015)
Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv preprint arXiv:1609.07358
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)
MathSciNet Google Scholar
Friedlander, M., Goh, G.: Efficient evaluation of scaled proximal operators. Electron. Trans. Numer. Anal. 46, 1–22 (2017)
MathSciNet MATH Google Scholar
Gao, W., Goldfarb, D.: Quasi-Newton methods: superlinear convergence without line search for self-concordant functions (2016). arXiv preprint arXiv:1612.06965
Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: IEEE Conference on Decision and Control, Los Angeles, USA, December 2014, pp. 5058–5063. CDC
Goel, V., Grossmann, I.E.: A class of stochastic programs with decision dependent uncertainty. Math. Program. 108, 355–394 (2006)
MathSciNet MATH Google Scholar
Grant, M., Boyd, S., Ye, Y.: Disciplined convex programming. In: Liberti, L., Maculan, N. (eds.) Global Optimization From Theory to Implementation, Nonconvex Optimization and its Applications, pp. 155–210. Springer, Berlin (2006)
Google Scholar
Halko, N., Martinsson, P.-G., Tropp, J.A.: Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions (2009)
Hazan, E., Arora, S.: Efficient Algorithms for Online Convex Optimization and their Applications. Princeton University, Princeton (2006)
Google Scholar
He, N., Harchaoui, Z., Wang, Y., Song, L.: Fast and simple optimization for Poisson likelihood models (2016). arXiv preprint arXiv:1608.01264
Hosmer, D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, New York (2005)
MATH Google Scholar
Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. JMLR W&CP 28(1), 427–435 (2013)
Google Scholar
Krishnapuram, B., Figueiredo, M., Carin, L., Hartemink, H.: Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 27, 957–968 (2005)
Google Scholar
Kyrillidis, A., Karimi, R., Tran-Dinh, Q., Cevher, V.: Scalable sparse covariance estimation via self-concordance. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 1946–1952 (2014)
Lebanon, G., Lafferty, J.: Boosting and maximum likelihood for exponential models. Adv. Neural Inf. Process. Syst. (NIPS) 14, 447 (2002)
Google Scholar
Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for convex optimization. SIAM J. Optim. 24(3), 1420–1443 (2014)
MathSciNet MATH Google Scholar
Lu, Z.: Randomized block proximal damped Newton method for composite self-concordant minimization. SIAM J. Optim. 27(3), 1910–1942 (2016)
MathSciNet MATH Google Scholar
Marron, J.S., Todd, M.J., Ahn, J.: Distance-weighted discrimination. J. Am. Stat. Assoc. 102(480), 1267–1271 (2007)
MathSciNet MATH Google Scholar
McCullagh, P., Nelder, J.A.: Generalized Linear Models, vol. 37. CRC Press, Boca Raton (1989)
MATH Google Scholar
Monteiro, R.D.C., Sicre, M.R., Svaiter, B.F.: A hybrid proximal extragradient self-concordant primal barrier method for monotone variational inequalities. SIAM J. Optim. 25(4), 1965–1996 (2015)
MathSciNet MATH Google Scholar
Nelder, J.A., Baker, R.J.: Generalized Linear Models. Encyclopedia of Statistical Sciences. Wiley, New York (1972)
Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Dordrecht (2004)
MATH Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
MathSciNet MATH Google Scholar
Nesterov, Y.: Cubic regularization of Newton’s method for convex problems with constraints. CORE Discussion Paper 2006/39, Catholic University of Louvain (UCL) - Center for Operations Research and Econometrics (CORE) (2006)
Nesterov, Y.: Accelerating the cubic regularization of Newtons method on convex problems. Math. Program. 112, 159–181 (2008)
MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140(1), 125–161 (2013)
MathSciNet MATH Google Scholar
Nesterov, Y., Nemirovski, A.: Interior-point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, Philadelphia (1994)
Google Scholar
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton’s method and its global performance. Math. Program. 112(1), 177–205 (2006)
MathSciNet MATH Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, Berlin (2006)
Google Scholar
O’Donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2015)
MathSciNet MATH Google Scholar
Odor, G., Li, Y.-H., Yurtsever, A., Hsieh, Y.-P., Tran-Dinh, Q., El-Halabi, M., Cevher, V.: Frank–Wolfe works for non-Lipschitz continuous gradient objectives: scalable Poisson phase retrieval. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6230–6234. IEEE (2016)
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Society for Industrial and Applied Mathematics, Philadelphia (2000)
MATH Google Scholar
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Google Scholar
Parlett, B.N., Landis, T.L.: Methods for scaling to doubly stochastic form. Linear Algebra Appl. 48, 53–79 (1982)
MathSciNet MATH Google Scholar
Peng, J., Roos, C., Terlaky, T.: Self-Regularity. A New Paradigm for Primal-Dual Interior-Point Algorithms. Princeton University Press, Princeton (2009)
MATH Google Scholar
Pilanci, M., Wainwright, M.J.: Newton sketch: a linear-time optimization algorithm with linear-quadratic convergence (2015). Arxiv preprint arXiv:1505.02250
Polyak, R.A.: Regularized Newton method for unconstrained convex optimization. Math. Program. 120(1), 125–145 (2009)
MathSciNet MATH Google Scholar
Robinson, S.M.: Strongly regular generalized equations. Math. Oper. Res. 5(1), 43–62, 5:43–62 (1980)
MathSciNet MATH Google Scholar
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods I: globally convergent algorithms (2016). arXiv preprint arXiv:1601.04737
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods II: local convergence rates (2016). arXiv preprint arXiv:1601.04738
Ryu, E.K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft (2014)
Toh, K.-Ch., Todd, M.J., Tütüncü, R.H.: On the implementation and usage of SDPT3—a Matlab software package for semidefinite-quadratic-linear programming. Tech. Report 4, NUS Singapore (2010)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: A proximal Newton framework for composite minimization: graph learning without Cholesky decompositions and matrix inversions. JMLR W&CP 28(2), 271–279 (2013)
Google Scholar
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)
MathSciNet MATH Google Scholar
Tran-Dinh, Q., Li, Y.-H., Cevher, V.: Composite convex minimization involving self-concordant-like cost functions. In: Pham Dinh, T., Le-Thi, H., Nguyen, N. (eds.) Modelling, Computation and Optimization in Information Systems and Management Sciences, pp. 155–168. Springer, New York (2015)
Google Scholar
Tran-Dinh, Q., Necoara, I., Diehl, M.: A dual decomposition algorithm for separable nonconvex optimization using the penalty function framework. In: Proceedings of the Conference on Decision and Control (CDC), Florence, Italy, December, pp. 2372–2377 (2013)
Tran-Dinh, Q., Necoara, I., Diehl, M.: Path-following gradient-based decomposition algorithms for separable convex optimization. J. Global Optim. 59(1), 59–80 (2014)
MathSciNet MATH Google Scholar
Tran-Dinh, Q., Necoara, I., Savorgnan, C., Diehl, M.: An inexact perturbed path-following method for Lagrangian decomposition in large-scale separable convex optimization. SIAM J. Optim. 23(1), 95–125 (2013)
MathSciNet MATH Google Scholar
Tran-Dinh, Q., Sun, T., Lu, S.: Self-concordant inclusions: a unified framework for path-following generalized Newton-type algorithms. Technical Report (submitted) (2016)
Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)
MATH Google Scholar
Verscheure, D., Demeulenaere, B., Swevers, J., De Schutter, J., Diehl, M.: Time-optimal path tracking for robots: a convex optimization approach. IEEE Trans. Autom. Control 54, 2318–2327 (2009)
MathSciNet MATH Google Scholar
Yamashita, M., Fujisawa, K., Kojima, M.: Implementation and evaluation of SDPA 6.0 (SemiDefinite Programming Algorithm 6.0). Optim. Method Softw. 18, 491–505 (2003)
MathSciNet MATH Google Scholar
Yang, T., Lin, Q.: RSG: beating SGD without smoothness and/or strong convexity. CoRR abs/1512.03107 (2016)
Zhang, Y., Lin, X.: DiSCO: Distributed optimization for self-concordant empirical loss. In: Proceedings of the 32th International Conference on Machine Learning, pp. 362–370 (2015)

Download references

Acknowledgements

This work is partially supported by the NSF-grant No. DMS-1619884, USA.

Author information

Authors and Affiliations

Department of Statistics and Operations Research, University of North Carolina at Chapel Hill (UNC), 318 Hanes Hall, CB# 3260, Chapel Hill, NC, 27599-3260, USA
Tianxiao Sun & Quoc Tran-Dinh

Authors

Tianxiao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Quoc Tran-Dinh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quoc Tran-Dinh.

Appendix: The proof of technical results

This appendix provides the full proofs of technical results presented in this paper. We prove some technical results used in the paper, and missing proofs in the main text. We also provide a full convergence analysis of the Newton-type methods presented in the main text.

1.1 The proof of Proposition 6: Fenchel’s conjugate

Let us consider the set $\mathcal {X}:= \{x\in \mathbb {R}^p \mid f(u) - \langle x, u\rangle ~\text {is bounded from below on}~\mathrm {dom}(f)\}$. We first show that $\mathrm {dom}(f^{*}) = \mathcal {X}$.

By the definition of $\mathrm {dom}(f^{*})$, we have $\mathrm {dom}(f^{*}) = \left\{ x\in \mathbb {R}^p \mid f^{*}(x) < +\,\infty \right\} $. Take any $x\in \mathrm {dom}(f^{*})$, one has $f^{*}(x) = \max _{u\in \mathrm {dom}(f)}\left\{ \langle x, u\rangle - f(u)\right\} <+\,\infty $. Hence, $f(u) - \langle x,u\rangle \ge -f^{*}(x) > -\infty $ for all $u\in \mathrm {dom}(f)$ which implies $x\in \mathcal {X}$.

Conversely, assume that $x\in \mathcal {X}$. By the definition of $\mathcal {X}$, $f(u)-\langle x,u\rangle $ is bounded from below for all $u\in \mathrm {dom}(f)$. That is, there exists $M \in [0, +\,\infty )$, such that $f(u) - \langle x, u\rangle \ge -M$ for all $u\in \mathrm {dom}(f)$. By the definition of the conjugate, $f^{*}(x) = \max _{u\in \mathrm {dom}(f)}\left\{ \langle x, u\rangle - f(u)\right\} \le M < +\,\infty $. Hence, $x\in \mathrm {dom}(f^{*})$.

For any $x\in \mathrm {dom}(f^{*})$, the optimality condition of $\max _{u}\left\{ \langle x, u\rangle - f(u)\right\} $ is $x = \nabla {f}(u)$. Let us denote by $x(u) = \nabla {f}(u)$. Then, we have $f^{*}(x(u)) = \langle x(u), u\rangle - f(u)$. Taking derivative of $f^{*}$ with respect to x on both sides, and using $x(u)=\nabla f(u)$, we have

$$\begin{aligned} \nabla _x f^{*}(x(u)) = u + u'_xx(u) - u'_x\nabla f(u) = u. \end{aligned}$$

We further take the second-order derivative of the above equation with respect to u to get

$$\begin{aligned} \nabla ^2f^{*}(x(u))x_u'(u) = \mathbb {I}. \end{aligned}$$

Using the two relations above and the fact that $x_u'(u) = \nabla ^2{f}(u)$, we can derive

$$\begin{aligned} \langle \nabla f^{*}(x(u)),x_u'(u)v\rangle&= \langle u,x_u'(u)v\rangle = \langle \nabla ^2f(u)v, u\rangle \end{aligned}$$

(50)

$$\begin{aligned} \langle \nabla ^2{f^{*}}(x(u))x_u'(u)v, x_u'(u)w\rangle&= \langle v, x_u'(u)w\rangle = \langle \nabla ^2f(u)v, w\rangle , \end{aligned}$$

(51)

where $u\in \mathrm {dom}(f)$, and $v, w\in \mathbb {R}^p$. Using (50) and (51), we can compute the third-order derivative of $f^{*}$ with respect to x(u) as

$$\begin{aligned} \langle \nabla ^3 f^{*}(x(u))[x_u'(u)w]x_u'(u)v,&x_u'(u)v\rangle = \langle \left( \langle \nabla ^2{f}^{*}(x(u))x_u'(u)v, x_u'(u)v\rangle \right) '_{u}, w\rangle \nonumber \\&- 2\langle \nabla ^2 f^{*}(x(u))x_u'(u)v, (x_u'(u)v)'_uw\rangle \nonumber \\&\overset{\tiny (50)}{=} \langle (\langle x_u'(u)v,v\rangle )'_u,w\rangle \nonumber \\&-2\langle \nabla ^2f^{*}(x(u))x_u'(u)v, (x_u'(u)v)'_uw\rangle \nonumber \\&\overset{\tiny (51)}{=} \langle \nabla ^3 f(u)[w]v,v\rangle -2\langle (x_u'(u)v)_u'w,v\rangle \nonumber \\&= -\langle \nabla ^3 f(u)[w]v,v\rangle . \end{aligned}$$

(52)

Denote $\xi := x_u'(u)w$ and $\eta := x_u'(u)v$. Since $x_u'(u) = \nabla ^2{f}(u)$, we have $\xi = \nabla ^2{f}(u)w$, $\eta = \nabla ^2{f}(u)v$, and $w = \nabla ^2{f}(u)^{-1}\xi $. Using these relations and $\nabla ^2f^{*}(x(u))x_u'(u) = \mathbb {I}$, we can derive

$$\begin{aligned} \begin{array}{ll} \vert \langle \nabla ^3{f^{*}}(x(u))[\xi ]\eta , \eta \rangle \vert &{}\overset{\tiny (52)}{=} \vert \langle \nabla ^3{f}(u)[w]v, v\rangle \overset{\tiny (5)}{\le } M_f\left\| v\right\| _u^2\left\| w\right\| _u^{\nu - 2}\left\| w\right\| _2^{3-\nu } \\ &{}= M_f\langle \nabla ^2f(u)v, v\rangle \langle \nabla ^2{f}(u)w, w\rangle ^{\frac{\nu -2}{2}}\Vert w\Vert ^{3-\nu }_2 \\ &{}= M_f\langle \eta , \nabla ^2f^{*}(x(u))x'(u)v\rangle \\ &{}\quad \langle \xi , \nabla ^2f^{*}(x(u))x'(u)w\rangle ^{\frac{\nu -2}{2}}\Vert \nabla ^2{f}(u)^{-1}\xi \Vert ^{3-\nu } \\ &{}= M_f\langle \nabla ^2f^{*}(x(u))\eta , \eta \rangle \langle \nabla ^2f^{*}(x(u))\xi , \xi \rangle ^{\frac{\nu -2}{2}}\\ &{}\quad \langle \nabla ^2f^{*}(x(u))\xi , \nabla ^2f^{*}(x(u))\xi \rangle ^{3-\nu }. \end{array} \end{aligned}$$

For any $H\in \mathcal {S}^p_{++}$, we have $\langle H\xi , \xi \rangle \le \left\| H\xi \right\| _2\left\| \xi \right\| _2$. For any $\nu \ge 3$, this inequality leads to

$$\begin{aligned} \langle H\xi , \xi \rangle ^{\frac{\nu -2}{2}}\left\| H\xi \right\| ^{3-\nu } \le \langle H\xi ,\xi \rangle ^{\frac{4-\nu }{2}}\left\| \xi \right\| _2^{\nu -3}. \end{aligned}$$

Using this inequality with $H = \nabla ^2f^{*}(x(u))$ into the last expression, we obtain

$$\begin{aligned} \begin{array}{ll} \vert \langle \nabla ^3{f^{*}}(x(u))[\xi ]\eta , \eta \rangle \vert &{}\le M_f\langle \nabla ^2f^{*}(x(u))\eta , \eta \rangle \langle \nabla ^2f^{*}(x(u))\xi , \xi \rangle ^{\frac{4 - \nu }{2}}\left\| \xi \right\| _2^{\nu -3}\\ &{}= M_f\Vert \eta \Vert _{x(u)}^2\left\| \xi \right\| _{x(u)}^{4-\nu }\Vert \xi \Vert _2^{\nu - 3}. \end{array} \end{aligned}$$

By Definition 2, we need $\nu - 3 = 3 - \nu _{*}$ and $4-\nu = \nu _{*} - 2$ which hold if $\nu _{*} = 6 - \nu $. Under the choice of $\nu _{*}$, the above inequality shows that $f^{*}$ is $(M_{f^{*}}, \nu _{*})$-generalized self-concordant with $M_{f^{*}} = M_f$ and $\nu _{*} = 6 - \nu $. However, to guarantee $\nu - 3 \ge 0$ and $6 - \nu > 0$, we require $3 \le \nu < 6$.

Finally, we prove the case of univariate functions, i.e., $p = 1$. Indeed, we have

$$\begin{aligned} x(u)=f'(u),~~ (f^{*})'(x(u))=u,~~\text {and}~~(f^{*})''(x(u))x'(u)=1. \end{aligned}$$

(53)

Here, $f'$ is the derivative of f with respect to u. Taking the derivative of the last equation on both sides with respect to u, we obtain

$$\begin{aligned} (f^{*})'''(x(u))(x'(u))^2+(f^{*})''(x(u))x''(u)=0. \end{aligned}$$

Solving this equation for $(f^{*})'''(x(u))$ and then using (53) and $x''(u) = f'''(u)$, we get

$$\begin{aligned} \begin{array}{rl} \left| (f^{*})'''(x(u))\right| =&{} \left| \frac{(f^{*})''(x(u))x''(u)}{(x'(u))^2}\right| = \left| ((f^{*})''(x(u)))^3f'''(u)\right| \\ \le &{} M_f\left| ((f^{*})''(x(u)))^3(f''(u))^{\frac{\nu }{2}}\right| = M_f((f^{*})''(x(u)))^{\frac{6-\nu }{2}}. \end{array} \end{aligned}$$

This inequality shows that $f^{*}$ is generalized self-concordant with $\nu _{*} = 6 - \nu $ for any $\nu \in (0, 6)$. $\square $

1.2 The proof of Corollary 2: bound on the mean of Hessian operator

Let $y_{\tau } := x+ \tau (y- x)$. Then $d_{\nu }(x, y_{\tau }) = \tau d_{\nu }(x, y)$. By (15), we have $\nabla ^2{f}(x+ \tau (y- x)) \preceq \left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{-2}{\nu -2}}\nabla ^2{f}(x)$ and $\nabla ^2{f}(x+ \tau (y- x)) \succeq \left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{2}{\nu -2}}\nabla ^2{f}(x)$. Hence, we have

$$\begin{aligned} \underline{I}_{\nu }(x,y)\nabla ^2{f}(x) \preceq \int _0^1\nabla ^2{f}(x+ \tau (y- x))d\tau \preceq \overline{I}_{\nu }(x, y)\nabla ^2{f}(x), \end{aligned}$$

where $\underline{I}_{\nu }(x, y) := \int _0^1\left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{2}{\nu -2}}d\tau $ and $\overline{I}_{\nu }(x, y) := \int _0^1\left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{-2}{\nu -2}}d\tau $ are the two integrals in the above inequality. Computing these integrals explicitly, we can show that

If $\nu = 4$, then $\underline{I}_{\nu }(x,y) = \frac{1 - (1 - d_4(x,y))^2}{2d_4(x,y)}$ and $\overline{I}_{\nu }(x, y) = \frac{-\ln (1 - d_4(x,y))}{d_4(x,y)}$.
If $\nu \ne 4$, then we can easily compute $\underline{I}_{\nu }(x, y) = \frac{(\nu -2)}{\nu d_{\nu }(x,y)}\left( 1 - \left( 1 - d_{\nu }(x,y)\right) ^{\frac{\nu }{\nu -2}}\right) $, and $\overline{I}_{\nu }(x, y) = \frac{(\nu -2)}{(\nu -4)d_{\nu }(x,y)}\left( 1 - \left( 1 - d_{\nu }(x,y)\right) ^{\frac{\nu -4}{\nu -2}}\right) $.

Hence, we obtain (18).

Finally, we prove for the case $\nu = 2$. Indeed, by (16), we have $e^{-d_2(x,y_{\tau })}\nabla ^2f(x) \preceq \nabla ^2f(y_{\tau }) \preceq e^{d_2(x,y_{\tau })}\nabla ^2f(x)$. Since $d_2(x, y_{\tau }) = \tau d_2(x, y)$, the last estimate leads to

$$\begin{aligned} \left( \int _0^1e^{-d_2(x,y)\tau }d\tau \right) \nabla ^2f(x) \preceq \int _0^1\nabla ^2f(y_{\tau })d\tau \preceq \left( \int _0^1e^{d_2(x,y)\tau }d\tau \right) \nabla ^2f(x), \end{aligned}$$

which is exactly (18). $\square $

1.3 Techical lemmas

The following lemmas will be used in our analysis. Lemma 1 is elementary, but we provide its proof for completeness.

Lemma 1

(a)
For a fixed $r \ge 1$ and $\bar{t} \in (0, 1)$, consider a function $\psi _r(t) := \frac{1 - (1-t)^r - rt(1-t)^r}{rt^2(1-t)^r}$ on $t\in (0, 1)$. Then, $\psi $ is positive and increasing on $(0, \bar{t}]$ and
$$\begin{aligned} \lim _{t\rightarrow 0^{+}}\psi _r(t) = \tfrac{r+1}{2},~~\lim _{t\rightarrow 1^{-}}\psi _r(t) = +\,\infty ,~~\text {and} ~~~\sup _{0 \le t \le \bar{t}}\left| \psi _r(t)\right| \le \bar{C}_r(\bar{t}) < +\,\infty , \end{aligned}$$
where $\bar{C}_r(\bar{t}) := \frac{1 - (1-\bar{t})^r - r\bar{t}(1-\bar{t})^r}{r\bar{t}^2(1-\bar{t})^r} \in (0, +\,\infty )$.
(b)
For $t > 0$, we also have $\frac{e^{t} - 1 - t}{t} \le \left( \frac{3}{2} + \frac{t}{3}\right) te^t$.

Proof

The statement $\mathrm {(b)}$ is rather elementary, we only prove $\mathrm {(a)}$. Since $r \ge 1$, $\lim _{t\rightarrow 0^{+}}(1 - (1-t)^r - rt(1-t)^r) = \lim _{t\rightarrow 0^{+}}rt^2(1-t)^r = 0$ and $rt^2(1-t)^r > 0$ for $t\in (0, 1)$, applying L’H$\hat{\mathrm {o}}$spital’s rule, we have

$$\begin{aligned} \lim _{t\rightarrow 0^+}\psi _r(t)= & {} \frac{\lim _{t\rightarrow 0^+}r(r+1)t(1-t)^{r-1}}{\lim _{t\rightarrow 0^+}rt(2-(2+r)t)(1-t)^{r-1}}\\= & {} \frac{\lim _{t\rightarrow 0^+}(r+1)}{\lim _{t\rightarrow 0^+}(2-(2+r)t)}=\frac{r+1}{2}. \end{aligned}$$

The limit $\lim _{t\rightarrow 1^{-}}\psi _r(t) = +\,\infty $ is obvious.

Next, it is easily to compute $\psi '_r(t) = \frac{(1-t)^{r+1}(rt+2)+(r+2)t-2}{rt^3(1-t)^{r+1}}$. Let $m_r(t) := (1-t)^{r+1}(rt+2)+(r+2)t-2$ be the numerator of $\psi '_r(t)$.

We have $m_r'(t) = r+2 - (1-t)^r(r^2t+2rt+r+2)$, and $m_r''(t) = r(r+1)(r+2)t(1-t)^{r-1}$. Clearly, since $r \ge 1$, $m_r''(t) \ge 0$ for $t \in [0, 1]$. This implies that $m_r'$ is nondecreasing on [0, 1]. Hence, $m_r'(t) \ge m_r'(0) = 0$ for all $t \in [0, 1]$. Consequently, $m_r$ is nondecreasing on [0, 1]. Therefore, $m_r(t) \ge m_r(0) = 0$ for all $t\in [0, 1]$. Using the formula of $\psi '_r$, we can see that $\psi '_r(t) \ge 0$ for all $t \in (0, 1)$. This implies that $\psi _r$ is nondecreasing on (0, 1). Moreover, $\lim _{t\rightarrow 0^+}\psi _r(t) = \frac{r+1}{2} > 0$. Hence, $\psi _r(t) > 0$ for all $t\in (0, 1)$. This implies that $\psi _r$ is bounded on $(0, \bar{t}] \subset (0, 1)$ by $\psi _r(\bar{t})$. $\square $

Similar to Corollary 2, we can prove the following lemma on the bound of the Hessian difference.

Lemma 2

Given $x, y\in \mathrm {dom}(f)$, the matrix $H(x,y)$ defined by

$$\begin{aligned} H(x,y) := \nabla ^2f(x)^{-1/2}\left[ \int _0^1\big (\nabla ^2{f}(x+ \tau (y-x)) - \nabla ^2f(x)\big )d\tau \right] \nabla ^2f(x)^{-1/2},\qquad \end{aligned}$$

(54)

satisfies

$$\begin{aligned} \Vert H(x,y) \Vert \le R_{\nu }\left( d_{\nu }(x, y)\right) d_{\nu }(x, y), \end{aligned}$$

(55)

where $R_{\nu }(t)$ is defined as follows for $t \in [0, 1)$:

$$\begin{aligned} R_{\nu }(t) := {\left\{ \begin{array}{ll} \left( \frac{3}{2} + \frac{t}{3}\right) e^t &{}\quad \text {if}\,\, \nu = 2\\ \frac{1 - (1-t)^{\frac{4-\nu }{\nu -2}} - \left( \frac{4-\nu }{\nu -2}\right) t(1-t)^{\frac{4-\nu }{\nu -2}}}{\left( \frac{4-\nu }{\nu -2}\right) t^2(1-t)^{\frac{4-\nu }{\nu -2}}} &{}\quad \text {if}\, 2 < \nu \le 3. \end{array}\right. } \end{aligned}$$

(56)

Moreover, for a fixed $\bar{t} \in (0, 1)$, we have $\sup _{0 \le t \le \bar{t}}\left| R_{\nu }(t)\right| \le \bar{M}_{\nu }(\bar{t})$, where

$$\begin{aligned} \bar{M}_{\nu }(\bar{t}) := \max \left\{ \frac{1 - (1 - \bar{t})^{\frac{4 - \nu }{\nu - 2}} - \left( \frac{4-\nu }{\nu - 2}\right) \bar{t}(1 - \bar{t})^{\frac{4 - \nu }{\nu - 2}}}{\left( \frac{4 - \nu }{\nu - 2}\right) \bar{t}^2(1 - \bar{t})^{\frac{4 - \nu }{\nu -2}}}, \left( \frac{3}{2} + \frac{\bar{t}}{2}\right) e^{\bar{t}} \right\} \in (0, +\,\infty ). \end{aligned}$$

Proof

By Corollary 2, if we define $G(x,y) := \int _0^1 \left[ \nabla ^2{f}(x+ \tau (y-x)) - \nabla ^2{f}(x)\right] d\tau $, then

$$\begin{aligned} \left[ \underline{\kappa }_{\nu }(d_{\nu }(x,y)) - 1\right] \nabla ^2{f}(x) \preceq G(x,y) \preceq \left[ \overline{\kappa }_{\nu }(d_{\nu }(x,y)) - 1\right] \nabla ^2{f}(x). \end{aligned}$$

(57)

Since $H(x,y) = \nabla ^2f(x)^{-1/2}G(x,y)\nabla ^2f(x)^{-1/2}$, the last inequality implies

$$\begin{aligned} \Vert H(x,y) \Vert \le \max \big \{1 - \underline{\kappa }_{\nu }(d_{\nu }(x,y)), \overline{\kappa }_{\nu }(d_{\nu }(x,y)) - 1\big \}. \end{aligned}$$

Let $C_{\max }(t) := \max \big \{1 - \underline{\kappa }_{\nu }(t), \overline{\kappa }_{\nu }(t) - 1 \big \}$ be for $t \in [0, 1)$. We consider three cases.

(a) For $\nu = 2$, since $e^{-t} + e^{t} \ge 2$, we have $\frac{1-e^{-t}}{t}+ \frac{e^{t}-1}{t} \ge 2$ which implies $C_{\max }(t) = \overline{\kappa }_{\nu }(t) - 1 = \frac{e^{t}-1-t}{t}$. Hence, by Lemma 1, we have $C_{\max }(t) \le \left( \frac{3}{2} + \frac{t}{3}\right) te^t$ which leads to $R_{\nu }(t) := \left( \frac{3}{2} + \frac{t}{3}\right) e^t$.

(b) For $\nu \in (2, 3]$, we have

$$\begin{aligned} \begin{array}{ll} C_{\max }(t) &{}= \max \left\{ 1 - \frac{(\nu - 2)}{\nu t}\left[ 1 - (1 - t)^{\frac{\nu }{\nu - 2}}\right] , \frac{(\nu - 2)}{(4 - \nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4 - \nu }{\nu -2}}} - 1\Big ] - 1\right\} \\ &{}= \frac{(\nu - 2)}{(4 - \nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4-\nu }{\nu -2}}} - 1\Big ] - 1. \end{array} \end{aligned}$$

Indeed, we show that $\frac{(\nu - 2)}{(4 -\nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4-\nu }{\nu -2}}} - 1\Big ] + \frac{(\nu - 2)}{\nu t}\left[ 1 - (1 - t)^{\frac{\nu }{\nu - 2}}\right] \ge 2$. Let $u := \frac{4-\nu }{\nu -2} > 0$ and $v := \frac{\nu }{\nu -2} > 0$. The last inequality is equivalent to $\frac{1}{u}\left[ \frac{1}{(1 - t)^u}-1\right] + \frac{1}{v}\left[ 1 - (1-t)^v\right] \ge 2t$ which can be reformulated as $\frac{1}{v} - \frac{1}{u} + \frac{1}{u(1-t)^u} - \frac{(1-t)^v}{v} - 2t \ge 0$. Consider $s(t) := \frac{1}{v} - \frac{1}{u} + \frac{1}{u(1-t)^u} - \frac{(1-t)^v}{v} - 2t$. It is clear that $s'(t) = \frac{1}{(1-t)^{u+1}} + (1-t)^{v-1} - 2 = (1-t)^{-\frac{2}{\nu -2}} + (1-t)^{\frac{2}{\nu -2}} - 2 \ge 0$ for all $t\in [0, 1)$. We obtain $s(t) \ge s(0) = 0$. Hence, $C_{\max }(t) = \frac{(\nu - 2)}{(4 - \nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4-\nu }{\nu -2}}} - 1\Big ] - 1$.

Let us define $r := \frac{4-\nu }{\nu -2} = \frac{2}{\nu -2} - 1$. Then, it is clear that $\nu = 2 + \frac{2}{1+r}$, and $\nu \in (2, 3]$ is equivalent to $r \ge 1$. Now, using Lemma 1 with $r = \frac{2}{\nu -2} - 1 \ge 1$, we obtain $R_{\nu }(t) := \frac{1 - (1-t)^{\frac{4-\nu }{\nu -2}} - \left( \frac{4-\nu }{\nu -2}\right) t(1-t)^{\frac{4-\nu }{\nu -2}}}{\left( \frac{4-\nu }{\nu -2}\right) t^2(1-t)^{\frac{4-\nu }{\nu -2}}}$. Put (a) and (b) together, we obtain (55) with $R_{\nu }$ defined by (56). The boundedness of $R_{\nu }$ follows from Lemma 1. $\square $

1.4 The proof of Theorem 4: solution existence and uniqueness

Consider a sublevel set $\mathcal {L}_F(x):=\left\{ y\in \mathrm {dom}(F) \mid F(y)\le F(x)\right\} $ of F in (32). For any $y\in \mathcal {L}_F(x)$ and $v\in \partial g(x)$, by (22) and the convexity of g, we have

$$\begin{aligned} F(x)\ge F(y)\ge F(x)+\left\langle \nabla f(x)+v,y- x\right\rangle +\omega _{\nu }\left( -d_{\nu }(x, y)\right) \left\| y-x\right\| _{x}^2. \end{aligned}$$

By the Cauchy-Schwarz inequality, we have

$$\begin{aligned} \omega _{\nu }\left( -d_{\nu }(x, y)\right) \left\| y-x\right\| _{x}\le \left\| \nabla f(x)+v\right\| _{x}^{*}. \end{aligned}$$

(58)

Now, using the assumption $\nabla ^2{f}(x)\succ 0$ for some $x\in \mathrm {dom}(F)$, we have $\sigma _{\min }(x) := \lambda _{\min }(\nabla ^2{f}(x)) > 0$, the smallest eigenvalue of $\nabla ^2{f}(x)$.

(a)
If $\nu = 2$, then $d_2(x,y)=M_f\left\| y-x\right\| _2\le \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\left\| y-x\right\| _{x}$. This estimate together with (58) imply
$$\begin{aligned} \omega _2\left( -d_2(x, y)\right) d_2(x,y)\le \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\left\| \nabla f(x)+v\right\| _{x}^{*} = \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x). \end{aligned}$$
(59)
We consider the function $s_2(t) := \omega _2(-t)t = 1 - \frac{1-e^{-t}}{t}$. Clearly, $s_2'(t) = \frac{e^t - t - 1}{t^2e^t} > 0$ for all $t \in \mathbb {R}_{+}$. Hence, $s_2(t)$ is increasing on $\mathbb {R}_{+}$. However, $s_2(t) < 1$ and $\lim \limits _{t\rightarrow +\,\infty }s_2(t) = 1$. Therefore, if $\frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x) < 1$, then the equation $s_2(t) - \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x) = 0$ has a unique solution $t^{*} \in (0, +\,\infty )$. In this case, for $0 \le d_2(x, y) \le t^{*}$, (59) holds. This condition leads to $M_f\left\| y-x\right\| _2 \le t^{*} <+\,\infty $ which implies that the sublevel set $\mathcal {L}_F(x)$ is bounded. Consequently, solution $x^{\star }$ of (32) exists.
(b)
If $2< \nu < 3$, then
$$\begin{aligned} d_{\nu }(x,y)\le \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\left\| y-x\right\| _{x}. \end{aligned}$$
This inequality together with (58) imply
$$\begin{aligned} \omega _{\nu }\left( -d_{\nu }(x, y)\right) d_{\nu }(x,y)\le & {} \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\left\| \nabla f(x)+v\right\| _{x}^{*}\\= & {} \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\lambda (x). \end{aligned}$$
We consider $s_{\nu }(t) := \omega _{\nu }(-t)t$. After a few elementary calculations, we can easily check that $s_{\nu }$ is increasing on $\mathbb {R}_{+}$ and $s_{\nu }(t) < \frac{\nu -2}{4-\nu }$ for all $t > 0$, and $\lim \limits _{t\rightarrow +\,\infty }s_{\nu }(t) = \frac{\nu -2}{4-\nu }$. Hence, if $\left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\lambda (x) < \frac{\nu -2}{4-\nu }$, then, similar to Case (a), we can show that solution $x^{\star }$ of (32) exists. This condition implies that $\lambda (x) < \frac{2\sigma _{\min }(x)^{\frac{3-\nu }{2}}}{(4-\nu )M_f}$.
(c)
If $\nu = 3$, then $d_3(x,y) = \frac{M_f}{2}\left\| y-x\right\| _{x}$. Combining this estimate and (58) we get
$$\begin{aligned} \omega _3\left( -d_3(x, y)\right) d_3(x,y)\le \frac{M_f}{2}\left\| \nabla f(x)+v\right\| _{x}^{*}. \end{aligned}$$
With the same proof as in [40, Theorem 4.1.11], if $\frac{M_f}{2}\left\| \nabla f(x)+v\right\| _{x}^{*} < 1$ which is equivalent to $\lambda (x) < \frac{2}{M_f}$, then solution $x^{\star }$ of (32) exists.

Note that the condition on $\lambda (x)$ in three cases (a), (b), and (c) can be unified. The uniqueness of the solution $x^{\star }$ in these three cases follows from the strict convexity of F. $\square $

1.5 The proof of Theorem 2: convergence of the damped-step Newton method

The proof of this theorem is divided into two parts: computing the step-size, and proving the local quadratic convergence.

Computing the step-size$\tau _k$: From Proposition 10, for any $x^k,x^{k+1}\in \mathrm {dom}(f)$, if $d_{\nu }(x^k,x^{k+1}) < 1$, then we have

$$\begin{aligned} f(x^{k+1}) \le f(x^k) + \langle \nabla {f}(x^k), x^{k+1}-x^k\rangle + \omega _{\nu }\left( d_{\nu }(x^k, x^{k+1})\right) \left\| x^{k+1} - x^k\right\| _{x^k}^2. \end{aligned}$$

Now, using (25), we have $\langle \nabla {f}(x^k), x^{k+1}-x^k\rangle = -\tau _k\left( \Vert \nabla {f}(x^k)\Vert _{x^k}^{*}\right) ^2 = -\tau _k\lambda _k^2$. On the other hand, we have

$$\begin{aligned} \begin{array}{ll} &{}\Vert x^{k+1} - x^k\Vert _{x^k}^2 \overset{\tiny (25)}{=} \tau _k^2\langle \nabla ^2f(x^k)^{-1}\nabla {f}(x^k),\nabla {f}(x^k)\rangle \overset{\tiny (27)}{=} \tau _k^2\lambda _k^2, \\ &{}\Vert x^{k+1} - x^k\Vert _2^2 \overset{\tiny (25)}{=} \tau _k^2\langle \nabla ^2f(x^k)^{-1}\nabla {f}(x^k), \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)\rangle \overset{\tiny (27)}{=} \frac{\tau _k^2\beta _k^2}{M_f^2}.\\ \end{array} \end{aligned}$$

Using the definition of $d_{\nu }(\cdot )$ in (12), the two last equalities, and (28), we can easily show that $d_{\nu }(x^k, x^{k+1}) = \tau _kd_k$. Substituting these relations into the first estimate, we obtain

$$\begin{aligned} f(x^{k+1}) \le f(x^k) - \left( \tau _k\lambda _k^2 - \omega _{\nu }\left( \tau _kd_k\right) \tau _k^2\lambda _k^2\right) . \end{aligned}$$

We consider the following cases:

(a) If $\nu = 2$, then, by (23), we have $\eta _k(\tau ) := \lambda _k^2\tau - \left( \frac{\lambda _k}{d_k}\right) ^2\left( e^{\tau d_k} - \tau d_k - 1\right) $ with $d_k = \beta _k$. This function attains the maximum at $\tau _k := \frac{\ln (1 + d_k)}{d_k} = \frac{\ln (1 + \beta _k)}{\beta _k} \in (0, 1)$ with

$$\begin{aligned} \eta _k(\tau _k) = \left( \frac{\lambda _k}{d_k}\right) ^2\Big [ (1 + d_k)\ln (1 + d_k) - d_k\Big ] = \left( \frac{\lambda _k}{\beta _k}\right) ^2\Big [ (1 + \beta _k)\ln (1 + \beta _k) - \beta _k\Big ]. \end{aligned}$$

It is easy to check from the right-most term of the last expression that $\varDelta _k := \eta _k(\tau _k) > 0$ for $\tau _k > 0$.

(b) If $\nu = 3$, then, by (23), we have $\eta _k(\tau ) := \lambda _k^2\tau + \left( \frac{\lambda _k}{d_k}\right) ^2\left[ \tau d_k + \ln (1 - \tau d_k)\right] $ with $d_k = 0.5M_f\lambda _k$. We can show that $\eta _k(\tau )$ achieves the maximum at $\tau _k = \frac{1}{1 + d_k} = \frac{1}{1 + 0.5M_f\lambda _k}\in (0,1)$ with

$$\begin{aligned} \eta _k(\tau _k) = \frac{\lambda _k^2}{1 + 0.5M_f\lambda _k}+\left( \frac{2}{M_f}\right) ^2\left[ \frac{0.5M_f\lambda _k}{1 + 0.5M_f\lambda _k}+\ln \left( 1-\frac{0.5M_f\lambda _k}{1 + 0.5M_f\lambda _k}\right) \right] . \end{aligned}$$

We can also easily check that the last term $\varDelta _k := \eta _k(\tau _k)$ of this expression is positive for $\lambda _k > 0$.

(c) If $2< \nu < 3$, then we have $d_k=M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \lambda _k^{\nu -2}\beta _k^{3-\nu }$. By (23), we have

$$\begin{aligned} \eta _k(\tau )= & {} \left( \lambda _k^2+\frac{\lambda _k^2}{d_k}\frac{\nu -2}{4-\nu }\right) \tau -\left( \frac{\lambda _k}{d_k}\right) ^2\frac{(\nu -2)^2}{2(4-\nu )(3-\nu )}\left( (1 - \tau d_k)^{\frac{2(3-\nu )}{2-\nu }} - 1\right) . \end{aligned}$$

Our aim is to find $\tau ^{*} \in (0, 1]$ by solving $\max _{\tau \in [0, 1]}\eta _k(\tau )$. This problem always has a global solution. First, we compute the first- and the second-order derivatives of $\eta _k$ as follows:

$$\begin{aligned} \eta _k'(\tau ) = \lambda _k^2\left[ 1 - \frac{1}{d_k}\frac{\nu -2}{\nu - 4}\left( 1-(1-\tau d_k)^{\frac{\nu -4}{\nu -2}}\right) \right] \text { and }\eta _k''(\tau )=-\lambda _k^2(1-\tau d_k)^{\frac{-2}{\nu -2}}. \end{aligned}$$

Let us set $\eta _k'(\tau _k) = 0$. Then, we get

$$\begin{aligned} \tau _k = \frac{1}{d_k}\left[ 1-\left( 1+\frac{4-\nu }{\nu -2}d_k\right) ^{-\frac{\nu -2}{4-\nu }}\right] \in (0,1)~~~~{\text {(by the Bernoulli inequality)}}, \end{aligned}$$

with

$$\begin{aligned} \eta _k(\tau _k)= & {} \frac{\lambda _k^2}{d_k}\left[ 1-\frac{4-\nu }{2(3-\nu )}\left( 1+\frac{4-\nu }{\nu -2}d_k\right) ^{2-\nu }\right] \\&+\left( \frac{\lambda _k}{d_k}\right) ^2 \frac{\nu -2}{2(3-\nu )}\left[ 1-\left( 1+\frac{4-\nu }{\nu -2}d_k\right) ^{2-\nu }\right] . \end{aligned}$$

In addition, we can check that $\eta _k''(\tau _k) < 0$. Hence, the value of $\tau _k$ above achieves the maximum of $\eta _k(\cdot )$. Then, we have $\varDelta _k := \eta _k(\tau _k) > \eta _k(0)=0$.

The proof of local quadratic convergence Let $x^{\star }_f$ be the optimal solution of (24). We have

$$\begin{aligned} \begin{array}{ll} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k} &{}= \Vert x^k - \tau _k\nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k) - x^{\star }_f\Vert _{x^k} \\ &{}= (1-\tau _k)\Vert x^k - x^{\star }_f\Vert _{x^k} + \tau _k\Vert x^k - x^{\star }_f - \nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k)\Vert _{x^k}. \end{array} \end{aligned}$$

Hence, we can write

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k}= & {} (1 - \tau _k)\Vert x^k - x^{\star }_f\Vert _{x^k} + \tau _k\Vert \nabla ^2{f}(x^k)^{-1}\nonumber \\&\times \left[ \nabla {f}(x^{\star }_f) - \nabla {f}(x^k) - \nabla ^2{f}(x^k)(x^{\star }_f - x^k)\right] \Vert _{x^k}. \end{aligned}$$

(60)

Let us define $T_k := \Big \Vert \nabla ^2{f}(x^k)^{-1}\left[ \nabla {f}(x^{\star }_f) - \nabla {f}(x^k) - \nabla ^2{f}(x^k)(x^{\star }_f - x^k)\right] \Big \Vert _{x^k}$ and consider three cases as follows:

$\mathrm {(a)}$ For $\nu = 2$, using Corollary 2, we have $\left( \frac{1-e^{-\bar{\beta }_k}}{\bar{\beta }_k}\right) \nabla ^2{f}(x^k) \preceq \int _0^1\nabla ^2{f}(x^k + t(x^{\star }_f -x^k))dt \preceq \left( \frac{e^{\bar{\beta }_k}-1}{\bar{\beta }_k}\right) \nabla ^2{f}(x^k)$, where $\bar{\beta }_k := M_f\Vert x^k - x^{\star }_f\Vert _2$. Using the above inequality, we can show that

$$\begin{aligned} T_k\le & {} \max \left\{ 1 - \frac{1-e^{- \bar{\beta }_k}}{\bar{\beta }_k}, \frac{e^{\bar{\beta }_k}-1}{\bar{\beta }_k}-1\right\} \Vert x^k - x^{\star }_f\Vert _{x^k} \\= & {} \left( \frac{e^{\bar{\beta }_k} - 1 - \bar{\beta }_k}{\bar{\beta }_k^2}\right) \bar{\beta }_k\Vert x^k - x^{\star }_f\Vert _{x^k}. \end{aligned}$$

Let $\underline{\sigma }_k := \lambda _{\min }(\nabla ^2{f}(x^k))$. We first derive

$$\begin{aligned} \begin{array}{ll} \Vert \nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k)\Vert _2 &{}= \Vert \nabla ^2{f}(x^k)^{-1}(\nabla {f}(x^k) - \nabla {f}(x^{\star }_f))\Vert _2 \\ &{}= \Vert \int _0^1\nabla ^2{f}(x^k)^{-1}\nabla ^2{f}(x^k + t(x^{\star }_f - x^k))(x^k - x^{\star }_f) dt\Vert _2 \\ &{}= \Vert \nabla ^2{f}(x^k)^{-1/2}K(x^k,x^{\star }_f)\nabla ^2{f}(x^k)^{1/2}(x^k - x^{\star }_f)\Vert _2 \\ &{}\le \frac{1}{\sqrt{\underline{\sigma }_k}}\Vert K(x^k,x^{\star }_f)\Vert \Vert x^k - x^{\star }_f\Vert _{x^k}. \end{array} \end{aligned}$$

where $K(x^k,x^{\star }_f) :=\int _0^1 \nabla ^2{f}(x^k)^{-1/2}\nabla ^2{f}(x^k + t(x^{\star }_f - x^k) \nabla ^2{f}(x^k)^{-1/2}dt$. Using Corollary 2 and noting that $\bar{\beta }_k := M_f\Vert x^k - x^{\star }_f\Vert _2$, we can estimate $\Vert K(x^k,x^{\star }_f)\Vert \le \frac{e^{\bar{\beta }_k} - 1}{\bar{\beta }_k}$. Using the two last estimates, and the definition of $\beta _k$, we can derive

$$\begin{aligned} \begin{array}{ll} \beta _k&= M_f\Vert \nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k)\Vert _2 \le \frac{M_fe^{\bar{\beta }_k - 1}}{\bar{\beta }_k\sqrt{\underline{\sigma }_k}}\Vert x^k - x^{\star }_f\Vert _{x^k} \le M_fe\frac{\Vert x^k - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}}, \end{array} \end{aligned}$$

provided that $\bar{\beta }_k \le 1$. Since, the step-size $\tau _k = \frac{1}{\beta _k}\ln (1+\beta _k)$, we have $1 - \tau _k \le \frac{\beta _k}{2} \le \frac{M_fe\Vert x^k - x^{\star }_f\Vert _{x^k}}{2\sqrt{\underline{\sigma }_k}}$. On the other hand, $\frac{e^{\bar{\beta }_k}-1 - \bar{\beta }_k}{\bar{\beta }_k^2} \le \frac{e}{2}$ for all $0\le \bar{\beta }_k \le 1$. Substituting $T_k$ into (60) and using these relations, we have

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k} \le \tfrac{e}{2}\bar{\beta }_k\Vert x^k - x^{\star }_f\Vert _{x^k} + \tfrac{M_fe}{2}\tfrac{\Vert x^k - x^{\star }_f\Vert _{x^k}^2}{\sqrt{\underline{\sigma }_k}}, \end{aligned}$$

provided that $\bar{\beta }_k \le 1$. On the other hand, by Proposition 8, we have $\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le e^{\frac{\bar{\beta }_{k+1} + \bar{\beta }_k}{2}}\Vert x^{k+1} - x^{\star }_f\Vert _{x^k}$ and $\underline{\sigma }_{k+1}^{-1} \le e^{\bar{\beta }_k + \bar{\beta }_{k+1}}\underline{\sigma }_k^{-1}$. In addition, $\bar{\beta }_k \le \frac{M_f}{\sqrt{\underline{\sigma }_k}}\Vert x^{k} - x^{\star }_f\Vert _{x^k}$ Combining the above inequalities, we finally get

$$\begin{aligned} \frac{\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}}}{\sqrt{\underline{\sigma }_{k+1}}} \le M_fe^{1+\bar{\beta }_{k+1} + \bar{\beta }_k}\left( \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}} \right) ^2. \end{aligned}$$

Under the fact that $\beta _k\le 1$, and $\beta _{k+1} \le 1$, this estimate shows that $\left\{ \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}}\right\} $ quadratically converges to zero. Since $\Vert x^k - x^{\star }_f\Vert _2 \le \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}}$, we can also conclude that $\left\{ \Vert x^k - x^{\star }_f\Vert _2\right\} $ quadratically converges to zero.

$\mathrm {(b)}$ For $\nu = 3$, we can follow [40]. However, for completeness, we give a short proof here. Using Corollary 2, we have $\left( 1 - r_k + \frac{r_k^2}{3}\right) \nabla ^2{f}(x^k) \preceq \int _0^1\nabla ^2{f}(x^k + t(x^{\star }_f -x^k))dt \preceq \frac{1}{1-r_k}\nabla ^2{f}(x^k)$, where $r_k := 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k} < 1$. Using the above inequality, we can show that

$$\begin{aligned} T_k \le \max \left\{ r_k - \frac{r_k^2}{3}, \frac{r_k}{1 - r_k}\right\} \Vert x^k - x^{\star }_f\Vert _{x^k} = \frac{0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2}{1 - 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}}. \end{aligned}$$

Substituting $T_k$ into (60) and using $\tau _k = \frac{1}{1 + 0.5M_f\lambda _k}$, we have

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k}\le & {} \frac{0.5M_f\lambda _k}{1+0.5M_f\lambda _k}\Vert x^k - x^{\star }_f\Vert _{x^k} \nonumber \\&+ \frac{1}{1 + 0.5M_f\lambda _k}\left( \frac{0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2}{1 - 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}}\right) . \end{aligned}$$

Next, we need to upper bound $\lambda _k$. Since $\nabla {f}(x^{\star }_f) = 0$. Using Corollary 2, we can bound $\lambda _k$ as

$$\begin{aligned} \begin{array}{ll} \lambda _k &{}= \Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = \Vert \nabla ^2{f}(x^k)^{-1/2}(\nabla {f}(x^k) - \nabla {f}(x^{\star }_f))\Vert _2 \\ &{}= \Vert \int _0^1\nabla ^2{f}(x^k)^{-1/2}\nabla ^2f(x^k + t(x^{\star }_f - x^k))(x^{\star }_f - x^k)dt\Vert _2 \\ &{}\le \Vert x^k - x^{\star }_f\Vert _{x^k}\Vert \int _0^1\nabla ^2{f}(x^k)^{-1/2}\nabla ^2f(x^k + t(x^{\star }_f - x^k))\nabla ^2{f}(x^k)^{-1/2}dt\Vert _2 \\ &{}\overset{\tiny \text {Corollary~2}}{\le } \frac{\Vert x^k - x^{\star }_f\Vert _{x^k} }{1 - 0.5M_f \Vert x^k - x^{\star }_f\Vert _{x^k} } \le 2\Vert x^k - x^{\star }_f\Vert _{x^k}, \end{array} \end{aligned}$$

provided that $M_f\Vert x^k - x^{\star }_f\Vert _{x^k} < 1$. Overestimating the above inequality using this bound, we get

$$\begin{aligned} \begin{array}{ll} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k} &{} \le 0.5M_f\lambda _k\Vert x^k-x_f^{\star }\Vert _{x^k} + \frac{0.5M_f\Vert x^k-x_f^{\star }\Vert _{x^k}^2}{1-0.5M_f\Vert x^k-x_f^{\star }\Vert _{x^k}}\\ &{} \le M_f\Vert x^k-x_f^{\star }\Vert _{x^k}^2+M_f\Vert x^k-x_f^{\star }\Vert _{x^k}^2=2M_f\Vert x^k-x_f^{\star }\Vert _{x^k}^2, \end{array} \end{aligned}$$

provided that $M_f\Vert x^k-x_f^{\star }\Vert _{x^k} < 1$. On the other hand, we can also estimate $\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le \frac{\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k}}}{1 - 0.5M_f\left( \Vert x^{k+1} - x^{\star }_f\Vert _{x^{k}} + \Vert x^k - x^{\star }_f\Vert _{x^k}\right) }$. Combining the last two inequalities, we get

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le \frac{2M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2}{ 1 - 2M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2 - 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k}} \end{aligned}$$

The right-hand side function $\psi (t) = \frac{2M_f}{1 - 2M_ft^2 - 0.5M_ft} \le 4M_f$ on $t \in \left[ 0, \frac{1}{2M_f} \right] $. Hence, if $\Vert x^k - x^{\star }_f\Vert _{x^k} \le \frac{1}{2M_f}$, then $\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le 4M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2$. This shows that if $x^0\in \mathrm {dom}(f)$ is chosen such that $\Vert x^0 - x^{\star }_f\Vert _{x^0} \le \frac{1}{4M_f}$, then $\left\{ \Vert x^k - x^{\star }_f\Vert _{x^k}\right\} $ quadratically converges to zero.

$\mathrm {(c)}$ For $\nu \in (2, 3)$, with the same argument as in the proof of Theorem 3, we can show that

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^k} \le R_{\nu }(d^k_{\nu })d_{\nu }^k\Vert x^k - x^{\star }_f\Vert _{x^k}, \end{aligned}$$

where $R_{\nu }$ is defined by (56) and $d_{\nu }^k := M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \Vert x^k-x^{\star }_f\Vert _2^{3-\nu }\Vert x^k - x^{\star }_f\Vert _{x^k}^{\nu -2}$. Using again the argument as in the proof of Theorem 3, we have

$$\begin{aligned} \frac{\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}}}{\underline{\sigma }_{k+1}^{\frac{3-\nu }{2}}} \le C_{\nu }(d^k_{\nu },\Vert x^k - x^{\star }_f\Vert _{x^k})\left( \frac{\Vert x^k - x^{\star }_f\Vert _{x^k}}{ \underline{\sigma }_k^{\frac{3-\nu }{2}} }\right) ^2. \end{aligned}$$

Here, $C_{\nu }(\cdot ,\cdot )$ is a given function deriving from $R_{\nu }$. Under the condition that $d^k_{\nu }$ and $\Vert x^k - x^{\star }_f\Vert _{x^k}$ are sufficiently small, we can show that $C_{\nu }(d^k_{\nu },\Vert x^k - x^{\star }_f\Vert _{x^k}) \le \bar{C}_{\nu }$. Hence, the last inequality shows that $\Big \{ \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\underline{\sigma }_k^{\frac{3-\nu }{2}} } \Big \}$ quadratically converges to zero. Since $\underline{\sigma }_k^{\frac{3-\nu }{2}}\Vert x^k -x^{\star }_f\Vert _{H_k} \le \Vert x^k - x^{\star }_f\Vert _{x^k}$, where $H_k := \nabla ^2{f}(x^k)^{\frac{\nu -2}{2}}$, we have $\Vert x^k -x^{\star }_f\Vert _{H_k} \le \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\underline{\sigma }_k^{\frac{3-\nu }{2}} }$. Hence, we can conclude that $\left\{ \Vert x^k -x^{\star }_f\Vert _{H_k}\right\} $ also locally converges to zero at a quadratic rate. $\square $

1.6 The proof of Theorem 3: the convergence of the full-step Newton method

We divide this proof into two parts: the quadratic convergence of $\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}$, and the quadratic convergence of $\big \{\Vert x^k - x^{\star }_f\Vert _{H_k}\big \}$.

The quadratic convergence of$\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}$: Since the full-step Newton scheme updates $x^{k+1} := x^k - \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)$, if we denote by $n_{\mathrm {nt}}^k = x^{k+1} -x^k = - \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)$, then the last expression leads to $\nabla {f}(x^k) + \nabla ^2f(x^k)n_{\mathrm {nt}}^k = 0$. In addition, $\Vert n_{\mathrm {nt}}^k\Vert _{x^k} = \Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = \lambda _k$. Using the definition of $d_{\nu }(\cdot ,\cdot )$ in (12), we denote $d^k_{\nu } := d_{\nu }(x^k, x^{k+1})$.

First, by $\nabla {f}(x^k) + \nabla ^2f(x^k)n_{\mathrm {nt}}^k = 0$ and the mean-value theorem, we can show that

$$\begin{aligned} \nabla {f}(x^{k+1})= & {} \nabla {f}(x^{k+1}) - \nabla {f}(x^k) - \nabla ^2f(x^k)n_{\mathrm {nt}}^k\\= & {} \int _0^1\left[ \nabla ^2{f}(x^k + tn_{\mathrm {nt}}^k) - \nabla ^2{f}(x^k)\right] n_{\mathrm {nt}}^kdt. \end{aligned}$$

Let us define $G_k := \int _0^1\left[ \nabla ^2{f}(x^k + tn_{\mathrm {nt}}^k) - \nabla ^2{f}(x^k)\right] dt$ and $H_k := \nabla ^2{f}(x^k)^{-1/2}G_k\nabla ^2{f}(x^k)^{-1/2}$. Then, the above estimate implies $ \nabla {f}(x^{k+1}) = G_kn_{\mathrm {nt}}^k$. Hence, we can show that

$$\begin{aligned} \left[ \Vert \nabla {f}(x^{k+1})\Vert _{x^k}^{*}\right] ^2&= \langle \nabla ^2{f}(x^k)^{-1}G_kn_{\mathrm {nt}}^k, G_kn_{\mathrm {nt}}^k\rangle \\&= \langle H_k\nabla ^2{f}(x^k)^{1/2}n_{\mathrm {nt}}^k, H_k\nabla ^2{f}(x^k)^{1/2}n_{\mathrm {nt}}^k\rangle \nonumber \\&\le \Vert H_k\Vert ^2\Vert n_{\mathrm {nt}}^k \Vert _{x^k}^2 = \Vert H_k\Vert ^2\lambda _k^2. \end{aligned}$$

By Lemma 2, we can estimate

$$\begin{aligned} \Vert H_k\Vert&\le R_{\nu }( d_{\nu }^k )d_{\nu }^k, \end{aligned}$$

where $R_{\nu }$ is defined by (56). Combining the two last inequalities and using Proposition 8, we consider the following cases:

(a) If $\nu = 2$, then we have $\lambda _{k+1}^2 \le e^{d_2^k}\left[ \left\| \nabla {f}(x^{k+1})\right\| _{x^k}^{*}\right] ^2$ which implies $\lambda _{k+1} \le e^{\frac{d_2^k}{2}}R_2(d_2^k)d_2^k\lambda _k$. Note that $\lambda _k \ge \frac{\sqrt{\underline{\sigma }_k}d_2^k}{M_f}$ and $\frac{1}{\underline{\sigma }_{k+1}}\le \frac{e^{d_2^k}}{\underline{\sigma }_k}$. Based on the above inequality, we have

$$\begin{aligned} \frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}}\le M_f R_2(d_2^k)e^{d_2^k}\left( \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right) ^2. \end{aligned}$$

By a numerical calculation, we can easily check that if $d_2^k < d_2^{\star }\approx 0.12964$, then

$$\begin{aligned} \frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}}\le 2M_f\left( \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right) ^2. \end{aligned}$$

Consequently, if $\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} < \frac{1}{M_f}\min \left\{ d_2^{\star },0.5\right\} = \frac{d_2^{\star }}{M_f}$, then we can prove

$$\begin{aligned} d_2^{k+1} \le d_2^{k}\text { and }\frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}} \le \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}, \end{aligned}$$

by induction. Under the condition $\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} < \frac{d_2^{\star }}{M_f}$, the above inequality shows that the ratio $\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} $ converges to zero at a quadratic rate.

Now, if $\nu > 2$, then we consider different cases. Note that

$$\begin{aligned} \lambda _{k+1}^2 \le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\left[ \left\| \nabla {f}(x^{k+1})\right\| _{x^k}^{*}\right] ^2, \end{aligned}$$

which follows that

$$\begin{aligned} \lambda _{k+1}\le (1-d_{\nu }^k)^{\frac{-1}{\nu -2}}R_{\nu }(d_{\nu }^k)d_{\nu }^k\lambda _k. \end{aligned}$$

(61)

Note that $d_{\nu }^k=\left( \frac{\nu }{2}-1\right) M_f\left\| d^k\right\| _2^{3-\nu }\lambda _k^{\nu -2}$ and $\underline{\sigma }_{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\underline{\sigma }_k^{-1}$. Based on these relations and (61) we can argue as follows:

$\mathrm {(b)}$ If $2< \nu < 3$, then $\lambda _k \ge \left\| d^k\right\| _2\sqrt{\underline{\sigma }_k}$ which follows that $d_{\nu }^k\le \left( \frac{\nu }{2}-1\right) M_f\underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k$. Hence,

$$\begin{aligned} \frac{\lambda _{k+1}}{\underline{\sigma }_{k+1}^{\frac{3-\nu }{2}}}\le (1-d_{\nu }^k)^{-\frac{4-\nu }{\nu -2}}R_{\nu }(d_{\nu }^k)\left( \frac{\nu }{2}-1\right) M_f\left( \frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\right) ^2. \end{aligned}$$

If $d_{\nu }^k < d_{\nu }^{\star }$, where $d_{\nu }^{\star }$ is the unique solution to the equation

$$\begin{aligned} \left( \frac{\nu }{2}-1\right) \frac{R_{\nu }(d_{\nu }^k)}{(1-d_{\nu }^k)^{\frac{4-\nu }{\nu -2}}}= 2, \end{aligned}$$

then $\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1}\le 2M_f\left( \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k \right) ^2$. Note that it is straightforward to check that this equation always admits a positive solution. Hence, if we choose $x^0\in \mathrm {dom}(f)$ such that $\underline{\sigma }_0^{-\frac{3-\nu }{2}}\lambda _0 < \frac{1}{M_f}\min \left\{ \frac{2d_{\nu }^{\star }}{\nu -2},\frac{1}{2}\right\} $, then we can prove the following two inequalities together by induction:

$$\begin{aligned} d_{\nu }^k \le d_{\nu }^{k+1}\text { and }\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1} \le \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k. \end{aligned}$$

In addition, the above inequality also shows that $\left\{ \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\right\} $ quadratically converges to zero.

$\mathrm {(c)}$ If $\nu = 3$, then $d_3^k= \frac{M_f}{2}\lambda _k$, and

$$\begin{aligned} \lambda _{k+1}\le (1-d_3^k)^{-1}R_3(d_3^k)d_3^k\lambda _k=M_f\frac{R_3(d_3^k)}{2(1-d_3^k)}\lambda _k^2. \end{aligned}$$

Directly checking the right-hand side of the above estimate, one can show that if $d_3^k < d_3^{\star }=0.5$, then $\lambda _{k+1}\le 2M_f\lambda _k^2$. Hence, if $\lambda _0 < \frac{1}{M_f}\min \left\{ 2d_3^{\star },0.5\right\} = \frac{1}{2M_f}$, then we can prove the following two inequalities together by induction:

$$\begin{aligned} d_3^{k+1} \le d_3^k\text { and }\lambda _{k+1} \le \lambda _k. \end{aligned}$$

Moreover, the first inequality above also shows that $\left\{ \lambda _k\right\} $ converges to zero at a quadratic rate.

The quadratic convergence of$\big \{\Vert x^k - x^{\star }_f\Vert _{H_k}\big \}$: First, using Proposition 9 with $x:= x^k$ and $y= x^{\star }_f$, and noting that $\nabla {f}(x^{\star }_f) = 0$, we have

$$\begin{aligned} \bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f))\Vert x^k - x^{\star }_f\Vert _{x^k}^2 \le \langle \nabla {f}(x^k), x^k - x^{\star }_f\rangle \le \Vert \nabla {f}(x^k)\Vert _{x^k}^{*}\Vert x^k - x^{\star }_f\Vert _{x^k}, \end{aligned}$$

where the last inequality follows from the Cauchy-Schwarz inequality. Hence, we obtain

$$\begin{aligned} \bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f))\Vert x^k - x^{\star }_f\Vert _{x^k} \le \Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = \lambda _k. \end{aligned}$$

(62)

We consider three cases:

(1) When $\nu = 2$, we have $\bar{\omega }_{\nu }(\tau ) = \frac{e^\tau -1}{\tau }$. Hence, $\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \frac{1 - e^{-d_{\nu }(x^k, x^{\star }_f)}}{d_{\nu }(x^k, x^{\star }_f)} \ge 1 - \frac{d_{\nu }(x^k, x^{\star }_f)}{2} \ge \frac{1}{2}$ whenever $d_{\nu }(x^k, x^{\star }_f) \le 1$. Using this inequality in (62), we have $\Vert x^k - x^{\star }_f\Vert _{x^k} \le 2\Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = 2\lambda _k$ provided that $d_{\nu }(x^k, x^{\star }_f) \le 1$. One the other hand, by the definition of $\underline{\sigma }_k$, we have $\sqrt{\underline{\sigma }_k}\Vert x^k - x^{\star }_f\Vert _2 \le \Vert x^k - x^{\star }_f\Vert _{x^k}$. Combining the two last inequalities, we obtain $\Vert x^k - x^{\star }_f\Vert _2 \le \frac{2\lambda _k}{\sqrt{\underline{\sigma }_k}}$ provided that $d_{\nu }(x^k, x^{\star }_f) \le 1$. Since $\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} $ locally converges to zero at a quadratic rate, the last relation also shows that $\big \{\Vert x^k - x^{\star }_f\Vert _2\big \}$ also locally converges to zero at a quadratic rate.

(2) For $\nu = 3$, we have $\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \frac{1}{1 + d_{\nu }(x^k, x^{\star }_f)}$ and $d_{\nu }(x^k, x^{\star }_f) = \frac{M_f}{2}\Vert x^k - x^{\star }_f\Vert _{x^k}$. Hence, from (62), we obtain $\frac{\Vert x^k - x^{\star }_f\Vert _{x^k} }{1 + 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k} } \le \lambda _k$. This implies $\Vert x^k - x^{\star }_f\Vert _{x^k} \le \frac{\lambda _k}{1 - 0.5M_f\lambda _k}$ as long as $0.5M_f\lambda _k < 1$. Clearly, since $\lambda _k$ locally converges to zero at a quadratic rate, $\Vert x^k - x^{\star }_f\Vert _{x^k}$ also locally converges to zero at a quadratic rate.

(3) For $2< \nu < 3$, we have $\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \left( \frac{\nu -2}{\nu -4}\right) \frac{\left( 1 + d_{\nu }(x^k, x^{\star }_f) \right) ^{\frac{\nu -4}{\nu -2}} - 1}{d_{\nu }(x^k, x^{\star }_f)} \ge 1 - \frac{1}{\nu -2}d_{\nu }(x^k, x^{\star }_f) \ge \frac{1}{2}$ provided that $d_{\nu }(x^k, x^{\star }_f) < \frac{\nu }{2}-1$. Similar to the case $\nu = 2$, we have $\underline{\sigma }_k^{\frac{3-\nu }{2}}\Vert x^k -x^{\star }_f\Vert _{H_k} \le \Vert x^k - x^{\star }_f\Vert _{x^k} \le 2\lambda _k$, where $H_k := \nabla ^2{f}(x^k)^{\frac{\nu -2}{2}}$. Hence, $\Vert x^k -x^{\star }_f\Vert _{H_k} \le \frac{2\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}$. Since $\big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\big \}$ locally converges to zero at a quadratic rate, $\big \{\Vert x^k -x^{\star }_f\Vert _{H_k} \big \}$ also locally converges to zero at a quadratic rate. $\square $

1.7 The proof of Theorem 5: convergence of the damped-step PN method

Given $H\in \mathcal {S}^p_{++}$ and a proper, closed, and convex function $g : \mathbb {R}^p\rightarrow \mathbb {R}\cup \{+\,\infty \}$, we define

$$\begin{aligned} \mathcal {P}_{H}^g(u):=(H+\partial g)^{-1}(u) = \mathrm {arg}\min _{x}\left\{ g(x) + \tfrac{1}{2}\left\langle Hx,x\right\rangle -\left\langle u,x\right\rangle \right\} . \end{aligned}$$

If $H= \nabla ^2{f}(x)$ is the Hessian mapping of a strictly convex function f, then we can also write $\mathcal {P}_{\nabla ^2 f(x)}(u)$ shortly as $\mathcal {P}_{x}(u)$ for our notational convenience. The following lemma will be used in the sequel whose proof can be found in [62].

Lemma 3

Let $g : \mathbb {R}^p\rightarrow \mathbb {R}\cup \{+\,\infty \}$ be a proper, closed, and convex function, and $H\in \mathcal {S}^p_{++}$. Then, the mapping $\mathcal {P}_{H}^g$ defined above is non-expansive with respect to the weighted norm defined by $H$, i.e., for any $u,v\in \mathbb {R}^p$, we have

$$\begin{aligned} \left\| \mathcal {P}^g_{H}(u)-\mathcal {P}^g_{H}(v)\right\| _{H} \le \left\| u-v\right\| ^{*}_{H}. \end{aligned}$$

(63)

Let us define

$$\begin{aligned} S_{x}(u):=\nabla ^2 f(x)u-\nabla f(u)~~~\text {and}~~~e_{x}(u,v):=[\nabla ^2 f(x)-\nabla ^2 f(u)](v-u), \end{aligned}$$

(64)

for any vectors $x,u\in \mathrm {dom}(f)$ and $v\in \mathbb {R}^p$. We now prove Theorem 5 in the main text.

The proof of Theorem 5

Computing the step-size$\tau _k$: Since $z^k$ satisfies the optimality condition (36), we have

$$\begin{aligned} -\nabla f(x^k) - \nabla ^2 f(x^k)n_{\mathrm {pnt}}^k \in \partial {g}(z^k). \end{aligned}$$

Using Proposition 10 we obtain

$$\begin{aligned} f(x^{k+1}) \le f(x^k) + \tau _k\left\langle \nabla f(x^k),n_{\mathrm {pnt}}^k\right\rangle + \omega _{\nu }(\tau _kd_k)\tau _k^2\lambda _k^2. \end{aligned}$$

Since $x^{k+1}=(1-\tau _k)x^k+\tau _kz^k$, using this relation and the convexity of g, we have

$$\begin{aligned} g(x^{k+1})\le g(x^k)-\tau _k\left\langle \nabla f(x^k)+\nabla ^2 f(x^k)n_{\mathrm {pnt}}^k, n_{\mathrm {pnt}}^k\right\rangle . \end{aligned}$$

Summing up the last two inequalities, we obtain the following estimate

$$\begin{aligned} F(x^{k+1}) \le F(x^k) - \eta _k(\tau _k). \end{aligned}$$

With the same argument as in the proof of Theorem 2, we obtain the conclusion of Theorem 5.

The proof of local quadratic convergence We consider the distance between $x^{k+1}$ and $x^{\star }$ measured by $\Vert x^{k+1}-x^{\star }\Vert _{x^{\star }}$. By the definition of $x^{k+1}$, we have

$$\begin{aligned} \Vert x^{k+1} - x^{\star }\Vert _{x^{\star }}\le (1-\tau _k)\Vert x^k-x^{\star }\Vert _{x^{\star }}+\tau _k\Vert z^k-x^{\star }\Vert _{x^{\star }}. \end{aligned}$$

(65)

Using the new notations in (64), it follows from the optimality condition (33) and (36) that $z^k = \mathcal {P}^g_{x^{\star }}(S_{x^{\star }}(x^k)+e_{x^{\star }}(x^k,z^k))$ and $x^{\star }=\mathcal {P}^g_{x^{\star }}(S_{x^{\star }}(x^{\star }))$. By Lemma 3 and the triangle inequality, we can show that

$$\begin{aligned} \Vert z^k-x^{\star }\Vert _{x^{\star }}\le \Vert S_{x^{\star }}(x^k) - S_{x^{\star }}(x^{\star })\Vert ^{*}_{x^{\star }} + \Vert e_{x^{\star }}(x^k,z^k)\Vert ^{*}_{x^{\star }}. \end{aligned}$$

(66)

By following the same argument as in [62], if we apply Lemma 2, then we can derive

$$\begin{aligned} \Vert S_{x^{\star }}(x^k) - S_{x^{\star }}(x^{\star }) \Vert ^{*}_{x^{\star }} \le R_{\nu }(d_{\nu }(x^{\star },x^k))d_{\nu }(x^{\star },x^k)\Vert x^k-x^{\star }\Vert _{x^{\star }}, \end{aligned}$$

(67)

where $R_{\nu }(\cdot )$ is defined by (56).

Next, using the same argument as the proof of (72) in Theorem 6 below, we can bound the second term $\Vert e_{x^{\star }}(x^k,z^k) \Vert ^{*}_{x^{\star }}$ of (66) as

$$\begin{aligned} \Vert e_{x^{\star }}(x^k,z^k) \Vert _{x^{\star }}^{*} \le {\left\{ \begin{array}{ll} \big [(1-d_{\nu }(x^{\star },x^k))^{\frac{-2}{\nu -2}}-1 \big ] \Vert z^k - x^k \Vert _{x^{\star }}, ~~~&{}\quad \text {if}\, \nu > 2 \\ \big (e^{d_{\nu }(x^{\star },x^k)} - 1 \big ) \Vert z^k - x^k \Vert _{x^{\star }} &{}\quad \text {if}\,\, \nu = 2. \end{array}\right. } \end{aligned}$$

Combining this inequality, (66), (67), and the triangle inequality, we obtain

$$\begin{aligned} \left\{ \begin{array}{l} \Vert z^k - x^k \Vert _{x^{\star }} \le \hat{R}_{\nu }( d_{\nu }(x^{\star },x^k)) \Vert x^k-x^{\star } \Vert _{x^{\star }},\\ \Vert z^k - x^{\star } \Vert _{x^{\star }} \le \tilde{R}_{\nu }( d_{\nu }(x^{\star },x^k)) d_{\nu }(x^{\star },x^k) \Vert x^k-x^{\star } \Vert _{x^{\star }}, \end{array}\right. \end{aligned}$$

(68)

where $\hat{R}_{\nu }$ and $\tilde{R}_{\nu }$ are defined as

$$\begin{aligned} \hat{R}_{\nu }( t ) := {\left\{ \begin{array}{ll} \frac{tR_{\nu }(t)+1}{2-(1-t)^{\frac{-2}{\nu -2}}}, ~~~&{}\text {if }\,\,\nu> 2 \\ \frac{tR_{\nu }(t)+1}{2-e^{t}} &{}\text {if}\,\, \nu = 2 \end{array}\right. } \quad \text {and} \quad \tilde{R}_{\nu }(t) := {\left\{ \begin{array}{ll} \frac{tR_{\nu }(t)+(1-t)^{\frac{-2}{\nu -2}}-1}{t\left( 2-(1-t)^{\frac{-2}{\nu -2}}\right) }, ~~~&{}\text {if }\,\,\nu > 2 \\ \frac{tR_{\nu }(t)+e^t-1}{t(2-e^t)}&{}\text {if}\,\, \nu = 2,\end{array}\right. } \end{aligned}$$

respectively. After a few simple calculations, one can show that there exists a constant $c_{\nu } \in (0, +\,\infty )$ such that if $t\in [0,\bar{d}_{\nu }]$, then both $\hat{R}_{\nu }(t)$ and $\tilde{R}_{\nu }(t)\in [0,c_{\nu }]$ (when $t \rightarrow 0+$, consider the limit), where $\bar{d}_2:=\frac{3}{5}$ and $\bar{d}_{\nu }:= 1-\left( \frac{2}{3}\right) ^{\frac{\nu -2}{2}}$ for $\nu > 2$, respectively. Using this bound, (65), (68), and the fact that $\tau _k \le 1$, we can bound

$$\begin{aligned} \Vert x^{k+1}-x^{\star } \Vert _{x^{\star }} \le \left[ (1 - \tau _k) + c_{\nu } d_{\nu }(x^{\star },x^k)\right] \Vert x^k-x^{\star }\Vert _{x^{\star }}. \end{aligned}$$

(69)

Let $\underline{\sigma }^{\star } := \sigma _{\min }(\nabla ^2 f(x^{\star }))$ be the smallest eigenvalue of $\nabla ^2 f(x^{\star })$. We consider the following cases:

(a) If $\nu =2$, then, for $0 \le d_{\nu }(x^{\star }, x^k) \le \bar{d}_{\nu }$, we can bound $1-\tau _k$ as

$$\begin{aligned} \begin{array}{ll} 1-\tau _k &{} = 1-\frac{\ln (1+\beta _k)}{\beta _k} \le \frac{\beta _k}{2} = \frac{M_f}{2}\Vert z^k - x^k\Vert _2 \\ &{}\le \frac{M_f}{2}\frac{ \Vert z^k - x^k\Vert _{x^{\star }}}{\sqrt{\underline{\sigma }^{\star }}} \overset{\tiny (68)}{\le } \frac{c_{\nu } M_f}{2\sqrt{\underline{\sigma }^{\star }}} \Vert x^k-x^{\star }\Vert _{x^{\star }}. \end{array} \end{aligned}$$

On the other hand, we have $d_{\nu }(x^{\star },x^k)=M_f\Vert x^k - x^{\star } \Vert _2 \le \frac{M_f}{\sqrt{\underline{\sigma }^{\star }}}\Vert x^k-x^{\star }\Vert _{x^{\star }}$. Using these estimates into (69), we get

$$\begin{aligned} \Vert x^{k+1} - x^{\star } \Vert _{x^{\star }}\le & {} \left( \frac{c_{\nu }M_f}{2\sqrt{\underline{\sigma }^{\star }}}\Vert x^k-x^{\star }\Vert _{x^{\star }} + \frac{c_{\nu }M_f}{\sqrt{\underline{\sigma }^{\star }}}\Vert x^k-x^{\star }\Vert _{x^{\star }}\right) \Vert x^k-x^{\star }\Vert _{x^{\star }}\\= & {} \frac{3c_{\nu }M_f}{2\sqrt{\underline{\sigma }^{\star }}} \Vert x^k-x^{\star }\Vert _{x^{\star }}^2. \end{aligned}$$

Let $c^{\star }_{\nu } := \frac{3c_{\nu }M_f}{2\sqrt{\underline{\sigma }^{\star }}}$. The last estimate shows that if $\Vert x^0 - x^{\star }\Vert _{x^{\star }} \le \min \left\{ \frac{ \bar{d}_{\nu }\sqrt{\underline{\sigma }^{\star }}}{M_f}, \frac{1}{c^{\star }_{\nu }}\right\} $, then $\left\{ \Vert x^k - x^{\star }\Vert _{x^{\star }}\right\} $ quadratically converges to zero.

(b) If $2 < \nu \le 3$, then we first show that

$$\begin{aligned} d_{\nu }(x^{\star },x^k)=\left( \tfrac{\nu }{2}-1\right) M_f\Vert x^k - x^{\star }\Vert _2^{3-\nu }\Vert x^k - x^{\star }\Vert _{x^{\star }}^{\nu -2} \le \left( \tfrac{\nu }{2}-1\right) \frac{M_f}{\left( \underline{\sigma }^{\star }\right) ^{\frac{3-\nu }{2}}}\Vert x^k-x^{\star }\Vert _{x^{\star }}. \end{aligned}$$

Hence, if $\Vert x^k-x^{\star }\Vert _{x^{\star }}\le m_{\nu }\bar{d}_{\nu }$, where $m_{\nu } := \tfrac{2}{\nu -2}\frac{\left( \underline{\sigma }^{\star }\right) ^{\frac{3-\nu }{2}}}{M_f}$, then $d_{\nu }(x^{\star },x^k)\le \bar{d}_{\nu }$. Next, using the definition of $d_k$ in (28), we can bound it as

$$\begin{aligned} \begin{array}{ll} d_k &{} = M_f\left( \frac{\nu }{2}-1\right) \Vert z^k-x^k \Vert _{x^k}^{\nu -2}\Vert z^k - x^k\Vert _2^{3-\nu }\\ &{}\overset{(15)}{\le } M_f\left( \frac{\nu }{2}-1\right) \left[ \frac{ \Vert z^k - x^k\Vert _{x^{\star }}}{(1-d_{\nu }(x^{\star },x^k))^{\frac{1}{\nu -2}}}\right] ^{\nu -2}\frac{\Vert z^k - x^k\Vert _{x^{\star }}^{3-\nu }}{(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}} \\ \nonumber &{} \le \frac{M_f}{(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}\left( \frac{\nu }{2}-1\right) \Vert z^k - x^k\Vert _{x^{\star }} \overset{\tiny (68)}{\le } \frac{M_f(\nu -2)}{2(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}c_{\nu }\Vert x^k - x^{\star } \Vert _{x^{\star }}. \end{array} \end{aligned}$$

Using this estimate, we can bound $1-\tau _k$ as follows:

$$\begin{aligned} \begin{array}{ll} 1-\tau _k &{} = 1-\frac{1}{d_k}+\frac{1}{d_k}\left( 1-\frac{\frac{4-\nu }{\nu -2}d_k}{1+\frac{4-\nu }{\nu -2}d_k}\right) ^{\frac{\nu -2}{4-\nu }}\\ &{}\overset{\tiny \text {Bernoulli's inequality}}{\le } 1 - \frac{1}{d_k}+\frac{1}{d_k}\left( 1-\frac{\nu -2}{4-\nu }\frac{\frac{4-\nu }{\nu -2}d_k}{1+\frac{4-\nu }{\nu -2}d_k}\right) \\ \nonumber &{} = \frac{\frac{4-\nu }{\nu -2}d_k}{1+\frac{4-\nu }{\nu -2}d_k} \le \frac{4-\nu }{\nu -2}d_k \le \frac{M_f(4 -\nu )}{2(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}c_{\nu }\Vert x^k - x^{\star } \Vert _{x^{\star }} = n_{\nu }\Vert x^k - x^{\star }\Vert _{x^{\star }}, \end{array} \end{aligned}$$

where $n_{\nu } := \frac{(4 -\nu )M_f}{2(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}c_{\nu } > 0$. Substituting this estimate into (69) and noting that $d_{\nu }(x^{\star }, x^k) \le \frac{1}{m_{\nu }}\Vert x^k - x^{\star }\Vert _{x^{\star }}$, we get

$$\begin{aligned} \Vert x^{k+1} - x^{\star }\Vert _{x^{\star }} \le \left( n_{\nu } + \frac{c_{\nu }}{m_{\nu }}\right) \Vert x^k - x^{\star }\Vert _{x^{\star }}^2 := c^{*}_{\nu }\Vert x^k - x^{\star }\Vert _{x^{\star }}^2. \end{aligned}$$

Hence, if $\Vert x^0 - x^{\star }\Vert _{x^{\star }} \le \min \left\{ m_{\nu }\bar{d}_{\nu }, \frac{1}{c^{\star }_{\nu }}\right\} $, then the last estimate shows that the sequence $\left\{ \Vert x^k - x^{\star }\Vert _{x^{\star }}\right\} $ quadratically converges to zero.

In summary, there exists a neighborhood $\mathcal {N}(x^{\star })$ of $x^{\star }$, such that if $x^0\in \mathcal {N}(x^{\star })\cap \mathrm {dom}(F)$, then the whole sequence $\left\{ \Vert x^k-x^{\star }\Vert _{x^{\star }}\right\} $ quadratically converges to zero. $\square $

1.8 The proof of Theorem 6: locally quadratic convergence of the PN method

Since $z^k$ is the optimal solution to (35) which satisfies (36), we have $\nabla ^2 f(x^k)x^k-\nabla f(x^k)\in (\nabla ^2 f(x^k) + \partial g)(z^k)$. Using this optimality condition, we get

$$\begin{aligned} \begin{array}{lll} x^{k+1} &{}=z^k &{}= \mathcal {P}^g_{x^k}(S_{x^k}(x^k)+e_{x^k}(x^k,z^k))~~~~~\text { and }\\ x^{k+2} &{}=z^{k+1} &{}= \mathcal {P}^g_{x^k}(S_{x^k}(x^{k+1})+e_{x^k}(x^{k+1},z^{k+1})). \end{array} \end{aligned}$$

Let us define $\tilde{\lambda }_{k+1}:=\Vert n_{\mathrm {pnt}}^{k+1}\Vert _{x^k}$. Then, by Lemma (3) and the triangular inequality, we have

$$\begin{aligned} \begin{array}{lll} \tilde{\lambda }_{k+1} &{} \le &{} \left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}+\left\| e_{x^k}(x^{k+1},z^{k+1})-e_{x^k}(x^k,z^k)\right\| _{x^k}^{*} \\ &{} = &{} \left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}+\left\| e_{x^k}(x^{k+1},z^{k+1})\right\| _{x^k}^{*}. \end{array} \end{aligned}$$

(70)

Let us first bound the term $\left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}$ as follows:

$$\begin{aligned} \left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}\le R_{\nu }(d_{\nu }^k)d_{\nu }^k\lambda _k, \end{aligned}$$

(71)

where $R_{\nu }(t)$ is defined as (56). Indeed, from the mean-value theorem, we have

$$\begin{aligned} \left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}= & {} \left\| \int _0^1 [\nabla ^2 f(x^k+tn_{\mathrm {pnt}}^k)-\nabla ^2 f(x^k)]n_{\mathrm {pnt}}^k\mathrm {d}t\right\| _{x^k} \\\le & {} \left\| H(x^k,x^{k+1})\right\| \lambda _k, \end{aligned}$$

where $H$ is defined as (54). Combining the above inequality and (56) in Lemma 2, we get (71).

Next we bound the term $\left\| e_{x^k}(x^{k+1},z^{k+1})\right\| _{x^k}^{*}$ as follows:

$$\begin{aligned} \left\| e_{x^k}(x^{k+1}, z^{k+1})\right\| _{x^k} \le {\left\{ \begin{array}{ll} \big [(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}-1 \big ]\tilde{\lambda }_{k+1}, ~~~&{}\text {if}\, \nu > 2\\ (e^{d_{\nu }^k}-1)\tilde{\lambda }_{k+1} &{}\text {if}\,\, \nu = 2. \end{array}\right. } \end{aligned}$$

(72)

Note that

$$\begin{aligned} \left\| e_{x^k}(x^{k+1},z^{k+1})\right\| _{x^k}^{*}= & {} \left\| [\nabla ^2 f(x^k)-\nabla ^2 f(x^{k+1})](z^{k+1}-x^{k+1})\right\| _{x^k}^{*} \\\le & {} \Vert \widetilde{H}(x^k,x^{k+1})\Vert \tilde{\lambda }_{k+1}, \end{aligned}$$

where

$$\begin{aligned} \widetilde{H}(x,y):= & {} \nabla ^2 f(x)^{-1/2}\left( \nabla ^2 f(x)-\nabla ^2 f(y)\right) \nabla ^2 f(x)^{-1/2} \\= & {} \mathbb {I}- \nabla ^2{f}(x)^{-1/2}\nabla ^2{f}(y) \nabla ^2{f}(x)^{-1/2}. \end{aligned}$$

By Proposition 8, we have

$$\begin{aligned} \Vert \widetilde{H}(x,y)\Vert \le {\left\{ \begin{array}{ll} \max \left\{ 1-(1-d_{\nu }(x,y))^{\frac{2}{\nu -2}},(1-d_{\nu }(x,y))^{\frac{-2}{\nu -2}}-1\right\} , ~~~&{}\text {if}\, \nu > 2 \\ \max \left\{ 1-e^{-d_{\nu }(x,y)},e^{d_{\nu }(x,y)}-1\right\} &{}\text {if}\,\, \nu = 2. \end{array}\right. } \end{aligned}$$

This inequality can be simplified as

$$\begin{aligned} \Vert \widetilde{H}(x,y)\Vert \le {\left\{ \begin{array}{ll} (1-d_{\nu }(x,y))^{\frac{-2}{\nu -2}}-1, ~~~&{}\text {if }\,\nu > 2 \\ e^{d_{\nu }(x,y)}-1 &{}\text {if}\,\, \nu = 2. \end{array}\right. } \end{aligned}$$

(73)

Hence, the inequality (72) holds.

Now, we combine (70), (71), and (72), if $\nu = 2$, and assuming that $d_2^k < \ln 2$, then we get

$$\begin{aligned} \tilde{\lambda }_{k+1}\le \frac{R_2(d_2^k)d_2^k}{2-e^{d_2^k}}\lambda _k. \end{aligned}$$

By Proposition 8, we have $\lambda _{k+1}^2\le e^{d_{\nu }^k}\tilde{\lambda }_{k+1}^2$. Combining this estimate and the last inequality, we get

$$\begin{aligned} \lambda _{k+1}\le \frac{R_2(d_2^k)d_2^ke^{\frac{d_2^k}{2}}}{2-e^{d_2^k}}\lambda _k. \end{aligned}$$

(74)

Note that $\lambda _k \ge \frac{\sqrt{\underline{\sigma }_k}d_2^k}{M_f}$ and $\underline{\sigma }_{k+1}^{-1}\le e^{d_2^k}\underline{\sigma }_k^{-1}$. It follows from (74) that

$$\begin{aligned} \frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}}\le M_f\frac{R_2(d_2^k)e^{d_2^k}}{2-e^{d_2^k}}\left( \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right) ^2. \end{aligned}$$

By a numerical calculation, we can check that if $d_2^k \le d_2^{\star }\approx 0.35482$, then

$$\begin{aligned} \frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}}\le 2M_f\left( \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right) ^2. \end{aligned}$$

Hence, if we choose $x^0\in \mathrm {dom}(F)$ such that $\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} \le \frac{1}{M_f}\min \left\{ d_2^{\star },0.5\right\} = \frac{d_2^{\star }}{M_f}$, then we can prove the following two inequalities together by induction:

$$\begin{aligned} d_2^{k+1} \le d_2^{k}~~~~\text {and}~~~~\frac{\lambda _{k+1}}{\sqrt{\underline{\sigma }_{k+1}}} \le \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}. \end{aligned}$$

These inequalities show the nonincreasing monotonicity of $\left\{ d_2^k\right\} $ and $\left\{ \lambda _k\right\} $. The above inequality also shows the local quadratic convergence of the sequence $\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} $.

Now, if $\nu > 2$ and assume that $d_{\nu }^k < 1- \left( {\frac{1}{2}}\right) ^{\frac{\nu -2}{2}}$, then

$$\begin{aligned} \tilde{\lambda }_{k+1}\le \frac{R_{\nu }(d_{\nu }^k)d_{\nu }^k}{2-(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}}\lambda _k. \end{aligned}$$

By Proposition 8, we have $\lambda _{k+1}^2 \le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\tilde{\lambda }_{k+1}^2$. Hence, combining these inequalities, we get

$$\begin{aligned} \lambda _{k+1}\le \frac{R_{\nu }(d_{\nu }^k)d_{\nu }^k(1-d_{\nu }^k)^{\frac{-1}{\nu -2}}}{2-(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}}\lambda _k. \end{aligned}$$

(75)

Note that $d_{\nu }^k=\left( \frac{\nu }{2}-1\right) M_f\left\| p^k\right\| _2^{3-\nu }\lambda _k^{\nu -2}$, $\underline{\sigma }_{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\underline{\sigma }_k^{-1}$ and $\sigma _{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\sigma _k^{-1}$. Using these relations and (75), we consider two cases:

(a) If $\nu = 3$, then $d_3^k = \frac{M_f}{2}\lambda _k$, and

$$\begin{aligned} \lambda _{k+1}\le \frac{R_3(d_3^k)(1-d_3^k)^{-1}}{2-(1-d_3^k)^{-2}}d_3^k\lambda _k=M_f\frac{R_3(d_3^k)(1-d_3^k)^{-1}}{2\left( 2-(1-d_3^k)^{-2}\right) }\lambda _k^2. \end{aligned}$$

By a simple numerical calculation, we can show that if $d_3^k \le d_3^{\star }\approx 0.20943$, then $\lambda _{k+1}\le 2M_f\lambda _k^2$. Hence, if $\lambda _0 < \frac{1}{M_f}\min \left\{ 2d_3^{\star },0.5\right\} = \frac{2}{M_f}d_3^{\star }$, then we can prove the following two inequalities together by induction

$$\begin{aligned} d_3^{k+1} \le d_3^k\text { and }\lambda _{k+1} \le \lambda _k. \end{aligned}$$

These inequalities show the non-increasing monotonicity of $\left\{ d_2^k\right\} $ and $\left\{ \lambda _k\right\} $. The above inequality also shows the quadratic convergence of the sequence $\left\{ \lambda _k\right\} $.

(b) If $2< \nu < 3$, then $\lambda _k \ge \Vert p^k\Vert _2\sqrt{\underline{\sigma }_k}$ which implies that $d_{\nu }^k\le \left( \frac{\nu }{2}-1\right) M_f\underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k$. Hence, we have

$$\begin{aligned} \frac{\lambda _{k+1}}{\underline{\sigma }_{k+1}^{\frac{3-\nu }{2}}}\le \frac{R_{\nu }(d_{\nu }^k)(1-d_{\nu }^k)^{-\frac{4-\nu }{\nu -2}}}{2-(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}}\left( \frac{\nu }{2}-1\right) M_f\left( \frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\right) ^2. \end{aligned}$$

If $d_{\nu }^k < d_{\nu }^{\star }$, then $\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1}\le 2M_f\left( \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k \right) ^2$, where $d_{\nu }^{\star }$ is the unique solution to the equation

$$\begin{aligned} \frac{R_{\nu }(d_{\nu }^k)(1-d_{\nu }^k)^{-\frac{4-\nu }{\nu -2}}}{2-(1-d_{\nu }^k)^{\frac{-2}{\nu -2}}}\left( \frac{\nu }{2}-1\right) = 2. \end{aligned}$$

Note that it is straightforward to check that this equation always admits a positive solution. Therefore, if $\underline{\sigma }_0^{-\frac{3-\nu }{2}}\lambda _0 \le \frac{1}{M_f}\min \left\{ \frac{2d_{\nu }^{\star }}{\nu -2},\frac{1}{2}\right\} $, then we can prove the following two inequalities together by induction:

$$\begin{aligned} d_{\nu }^k \le d_{\nu }^{k+1}\text { and }\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1} \le \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k. \end{aligned}$$

These inequalities show the non-increasing monotonicity of $\left\{ d_2^k\right\} $ and $\left\{ \lambda _k\right\} $. The above inequality also shows the quadratic convergence of the sequence $\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}$.

Finally, to prove the local quadratic convergence of $\left\{ x^k\right\} $ to $x^{\star }$, we use the same argument as in the proof of Theorem 3 and Theorem 5, where we omit the details here. $\square $

1.9 The proof of Theorem 7: convergence of the quasi-Newton method

The full-step quasi-Newton method for solving (24) can be written as $x^{k+1} = x^k - B_k\nabla {f}(x^k)$. This is equivalent to $H_k(x^{k+1} - x^k) + \nabla {f}(x^k) = 0$. Using this relation and $\nabla {f}(x^{\star }_f) = 0$, we can write

$$\begin{aligned} x^{k+1} - x^{\star }_f= & {} \nabla ^2{f}(x^{\star }_f)^{-1}\left[ \nabla ^2{f}(x^{\star }_f)(x^k - x^{\star }_f) + \left( \nabla ^2{f}(x^{\star }_f) - H_k\right) (x^{k+1} - x^k)\right. \nonumber \\&\left. - \nabla {f}(x^k) + \nabla {f}(x^{\star }_f)\right] . \end{aligned}$$

(76)

We first consider $T_k := \Vert \nabla ^2{f}(x^{\star }_f)^{-1}\left[ \nabla {f}(x^k) - \nabla {f}(x^{\star }_f) - \nabla ^2{f}(x^{\star }_f)(x^k - x^{\star }_f) \right] \Vert _{x^{\star }_f}$. Similar to the proof of Theorem 3, we can show that

$$\begin{aligned} T_k= & {} \Big \Vert \int _0^1\nabla ^2{f}(x^{\star }_f)^{-1}\left[ \nabla ^2{f}(x^{\star }_f + t(x^k - x^{\star }_f)) - \nabla ^2{f}(x^{\star }_f)\right] (x^k - x^{\star }_f)\Big \Vert _{x^{\star }_f} \nonumber \\\le & {} R_{\nu }( d_{\nu }^k )d_{\nu }^k\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \end{aligned}$$

(77)

where $R_{\nu }$ is defined by (56) and $d_{\nu }^k := M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \Vert x^k-x^{\star }_f\Vert _2^{3-\nu }\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}^{\nu -2}$. Moreover, we note that

$$\begin{aligned} S_k:= & {} \Vert \nabla ^2{f}(x^{\star }_f)^{-1}\left( H_k - \nabla ^2{f}(x^{\star }_f)\right) (x^{k+1} - x^k)\Vert _{x^{\star }_f}\\= & {} \Vert \left( H_k - \nabla ^2{f}(x^{\star })\right) (x^{k+1} - x^k)\Vert ^{*}_{x^{\star }_f} \end{aligned}$$

Combining this estimate, (76), and (77), we can derive

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^{\star }_f} \le R_{\nu }( d_{\nu }^k )d_{\nu }^k\Vert x^k-x^{\star }_f\Vert _{x^{\star }_f} + \Vert \left( H_k - \nabla ^2{f}(x^{\star }_f)\right) (x^{k+1} - x^k)\Vert ^{*}_{x^{\star }_f}. \end{aligned}$$

(78)

First, we prove statement (a). Indeed, from the Dennis–Moré condition (41), we have

$$\begin{aligned} \Vert \left( H_k - \nabla ^2{f}(x^{\star }_f)\right) (x^{k+1} - x^k)\Vert ^{*}_{x^{\star }_f}\le & {} \gamma _k\Vert x^{k+1} -x_k\Vert _{x^{\star }_f}\\\le & {} \gamma _k\left( \Vert x^{k+1} - x^{\star }_f\Vert _{x^{\star }_f} + \Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\right) , \end{aligned}$$

where $\lim _{k\rightarrow \infty }\gamma _k = 0$. Substituting this estimate into (78), and noting that $\Vert x^k - x^{\star }_f\Vert _2 \le \frac{1}{\underline{\sigma }^{\star }}\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}$, where $\underline{\sigma }^{\star } := \lambda _{\min }(\nabla ^2{f}(x^{\star }_f)) > 0$, we can show that

$$\begin{aligned} \Vert x^{k+1} - x^{\star }_f\Vert _{x^{\star }_f} \le \frac{1}{1-\gamma _k}\left( R_{\nu }^{\star }\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}^2 + \gamma _k\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\right) , \end{aligned}$$

(79)

provided that $\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \le \bar{r}$ and $R_{\nu }^{\star } := \max \left\{ R_{\nu }(d_{\nu }^k) \mid \Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \le \bar{r}\right\} < +\,\infty $. Here, $\bar{r} > 0$ is a given value such that $R_{\nu }^{\star }$ is finite. The estimate (79) shows that if $\bar{r}$ is sufficiently small, $\left\{ \Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\right\} $ superlinearly converges to zero. Finally, the statement (b) is proved similarly by combining statement (a) and [62, Theorem 11]. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, T., Tran-Dinh, Q. Generalized self-concordant functions: a recipe for Newton-type methods. Math. Program. 178, 145–213 (2019). https://doi.org/10.1007/s10107-018-1282-4

Download citation

Received: 04 April 2017
Accepted: 25 April 2018
Published: 08 May 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s10107-018-1282-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalized self-concordant functions: a recipe for Newton-type methods

Abstract

Access this article

Similar content being viewed by others

Relaxed Inertial Method for Solving Split Monotone Variational Inclusion Problem with Multiple Output Sets Without Co-coerciveness and Lipschitz Continuity

Efficiency of higher-order algorithms for minimizing composite functions

Random Gradient-Free Minimization of Convex Functions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: The proof of technical results

1.1 The proof of Proposition 6: Fenchel’s conjugate

1.2 The proof of Corollary 2: bound on the mean of Hessian operator

1.3 Techical lemmas

Lemma 1

Proof

Lemma 2

Proof

1.4 The proof of Theorem 4: solution existence and uniqueness

1.5 The proof of Theorem 2: convergence of the damped-step Newton method

1.6 The proof of Theorem 3: the convergence of the full-step Newton method

1.7 The proof of Theorem 5: convergence of the damped-step PN method

Lemma 3

The proof of Theorem 5

1.8 The proof of Theorem 6: locally quadratic convergence of the PN method

1.9 The proof of Theorem 7: convergence of the quasi-Newton method

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Generalized self-concordant functions: a recipe for Newton-type methods

Abstract

Access this article

Similar content being viewed by others

Relaxed Inertial Method for Solving Split Monotone Variational Inclusion Problem with Multiple Output Sets Without Co-coerciveness and Lipschitz Continuity

Efficiency of higher-order algorithms for minimizing composite functions

Random Gradient-Free Minimization of Convex Functions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: The proof of technical results

Appendix: The proof of technical results

1.1 The proof of Proposition 6: Fenchel’s conjugate

1.2 The proof of Corollary 2: bound on the mean of Hessian operator

1.3 Techical lemmas

Lemma 1

Proof

Lemma 2

Proof

1.4 The proof of Theorem 4: solution existence and uniqueness

1.5 The proof of Theorem 2: convergence of the damped-step Newton method

1.6 The proof of Theorem 3: the convergence of the full-step Newton method

1.7 The proof of Theorem 5: convergence of the damped-step PN method

Lemma 3

The proof of Theorem 5

1.8 The proof of Theorem 6: locally quadratic convergence of the PN method

1.9 The proof of Theorem 7: convergence of the quasi-Newton method

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation