Abstract
We study the smooth structure of convex functions by generalizing a powerful concept so-called self-concordance introduced by Nesterov and Nemirovskii in the early 1990s to a broader class of convex functions which we call generalized self-concordant functions. This notion allows us to develop a unified framework for designing Newton-type methods to solve convex optimization problems. The proposed theory provides a mathematical tool to analyze both local and global convergence of Newton-type methods without imposing unverifiable assumptions as long as the underlying functionals fall into our class of generalized self-concordant functions. First, we introduce the class of generalized self-concordant functions which covers the class of standard self-concordant functions as a special case. Next, we establish several properties and key estimates of this function class which can be used to design numerical methods. Then, we apply this theory to develop several Newton-type methods for solving a class of smooth convex optimization problems involving generalized self-concordant functions. We provide an explicit step-size for a damped-step Newton-type scheme which can guarantee a global convergence without performing any globalization strategy. We also prove a local quadratic convergence of this method and its full-step variant without requiring the Lipschitz continuity of the objective Hessian mapping. Then, we extend our result to develop proximal Newton-type methods for a class of composite convex minimization problems involving generalized self-concordant functions. We also achieve both global and local convergence without additional assumptions. Finally, we verify our theoretical results via several numerical examples, and compare them with existing methods.
Similar content being viewed by others
References
Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15(1), 595–627 (2014)
Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding agorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Comput. 3(3), 165–218 (2011)
Becker, S., Fadili, M.J.: A quasi-Newton proximal splitting method. In: Proceedings of Neutral Information Processing Systems Foundation (NIPS) (2012)
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization (2016). arXiv preprint arXiv:1609.08502
Bonnans, J.F.: Local analysis of Newton-type methods for variational inequalities and nonlinear programming. Appl. Math. Optim. 29, 161–186 (1994)
Borodin, A., El-Yaniv, R., Gogan, V.: Can we learn to beat the best stock. J. Artif. Intell. Res. (JAIR) 21, 579–594 (2004)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chen, T.-Y., Demmel, J.W.: Balancing sparse matrices for computing eigenvalues. Linear Algebra Appl. 309(1–3), 261–287 (2000)
Cohen, M., Madry, A., Tsipras, D., Vladu, A.: Matrix scaling and balancing via box constrained Newton’s method and interior-point methods. The 58th Annual IEEE Symposium on Foundations of Computer Science, pp. 902–913 (2017)
Dennis, J.E., Moré, J.J.: A characterisation of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28, 549–560 (1974)
Deuflhard, P.: Newton Methods for Nonlinear Problems—Affine Invariance and Adaptative Algorithms, volume 35 of Springer Series in Computational Mathematics, 2nd edn. Springer, Berlin (2006)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91, 201–213 (2002)
Erdogdu, M.A., Montanari, A.: Convergence rates of sub-sampled Newton methods. In: Advances in Neural Information Processing Systems, pp. 3052–3060 (2015)
Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv preprint arXiv:1609.07358
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)
Friedlander, M., Goh, G.: Efficient evaluation of scaled proximal operators. Electron. Trans. Numer. Anal. 46, 1–22 (2017)
Gao, W., Goldfarb, D.: Quasi-Newton methods: superlinear convergence without line search for self-concordant functions (2016). arXiv preprint arXiv:1612.06965
Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: IEEE Conference on Decision and Control, Los Angeles, USA, December 2014, pp. 5058–5063. CDC
Goel, V., Grossmann, I.E.: A class of stochastic programs with decision dependent uncertainty. Math. Program. 108, 355–394 (2006)
Grant, M., Boyd, S., Ye, Y.: Disciplined convex programming. In: Liberti, L., Maculan, N. (eds.) Global Optimization From Theory to Implementation, Nonconvex Optimization and its Applications, pp. 155–210. Springer, Berlin (2006)
Halko, N., Martinsson, P.-G., Tropp, J.A.: Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions (2009)
Hazan, E., Arora, S.: Efficient Algorithms for Online Convex Optimization and their Applications. Princeton University, Princeton (2006)
He, N., Harchaoui, Z., Wang, Y., Song, L.: Fast and simple optimization for Poisson likelihood models (2016). arXiv preprint arXiv:1608.01264
Hosmer, D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, New York (2005)
Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. JMLR W&CP 28(1), 427–435 (2013)
Krishnapuram, B., Figueiredo, M., Carin, L., Hartemink, H.: Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 27, 957–968 (2005)
Kyrillidis, A., Karimi, R., Tran-Dinh, Q., Cevher, V.: Scalable sparse covariance estimation via self-concordance. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 1946–1952 (2014)
Lebanon, G., Lafferty, J.: Boosting and maximum likelihood for exponential models. Adv. Neural Inf. Process. Syst. (NIPS) 14, 447 (2002)
Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for convex optimization. SIAM J. Optim. 24(3), 1420–1443 (2014)
Lu, Z.: Randomized block proximal damped Newton method for composite self-concordant minimization. SIAM J. Optim. 27(3), 1910–1942 (2016)
Marron, J.S., Todd, M.J., Ahn, J.: Distance-weighted discrimination. J. Am. Stat. Assoc. 102(480), 1267–1271 (2007)
McCullagh, P., Nelder, J.A.: Generalized Linear Models, vol. 37. CRC Press, Boca Raton (1989)
Monteiro, R.D.C., Sicre, M.R., Svaiter, B.F.: A hybrid proximal extragradient self-concordant primal barrier method for monotone variational inequalities. SIAM J. Optim. 25(4), 1965–1996 (2015)
Nelder, J.A., Baker, R.J.: Generalized Linear Models. Encyclopedia of Statistical Sciences. Wiley, New York (1972)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Dordrecht (2004)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Cubic regularization of Newton’s method for convex problems with constraints. CORE Discussion Paper 2006/39, Catholic University of Louvain (UCL) - Center for Operations Research and Econometrics (CORE) (2006)
Nesterov, Y.: Accelerating the cubic regularization of Newtons method on convex problems. Math. Program. 112, 159–181 (2008)
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140(1), 125–161 (2013)
Nesterov, Y., Nemirovski, A.: Interior-point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, Philadelphia (1994)
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton’s method and its global performance. Math. Program. 112(1), 177–205 (2006)
Nocedal, J., Wright, S.J.: Numerical Optimization, Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, Berlin (2006)
O’Donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2015)
Odor, G., Li, Y.-H., Yurtsever, A., Hsieh, Y.-P., Tran-Dinh, Q., El-Halabi, M., Cevher, V.: Frank–Wolfe works for non-Lipschitz continuous gradient objectives: scalable Poisson phase retrieval. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6230–6234. IEEE (2016)
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Society for Industrial and Applied Mathematics, Philadelphia (2000)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Parlett, B.N., Landis, T.L.: Methods for scaling to doubly stochastic form. Linear Algebra Appl. 48, 53–79 (1982)
Peng, J., Roos, C., Terlaky, T.: Self-Regularity. A New Paradigm for Primal-Dual Interior-Point Algorithms. Princeton University Press, Princeton (2009)
Pilanci, M., Wainwright, M.J.: Newton sketch: a linear-time optimization algorithm with linear-quadratic convergence (2015). Arxiv preprint arXiv:1505.02250
Polyak, R.A.: Regularized Newton method for unconstrained convex optimization. Math. Program. 120(1), 125–145 (2009)
Robinson, S.M.: Strongly regular generalized equations. Math. Oper. Res. 5(1), 43–62, 5:43–62 (1980)
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods I: globally convergent algorithms (2016). arXiv preprint arXiv:1601.04737
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods II: local convergence rates (2016). arXiv preprint arXiv:1601.04738
Ryu, E.K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft (2014)
Toh, K.-Ch., Todd, M.J., Tütüncü, R.H.: On the implementation and usage of SDPT3—a Matlab software package for semidefinite-quadratic-linear programming. Tech. Report 4, NUS Singapore (2010)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: A proximal Newton framework for composite minimization: graph learning without Cholesky decompositions and matrix inversions. JMLR W&CP 28(2), 271–279 (2013)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)
Tran-Dinh, Q., Li, Y.-H., Cevher, V.: Composite convex minimization involving self-concordant-like cost functions. In: Pham Dinh, T., Le-Thi, H., Nguyen, N. (eds.) Modelling, Computation and Optimization in Information Systems and Management Sciences, pp. 155–168. Springer, New York (2015)
Tran-Dinh, Q., Necoara, I., Diehl, M.: A dual decomposition algorithm for separable nonconvex optimization using the penalty function framework. In: Proceedings of the Conference on Decision and Control (CDC), Florence, Italy, December, pp. 2372–2377 (2013)
Tran-Dinh, Q., Necoara, I., Diehl, M.: Path-following gradient-based decomposition algorithms for separable convex optimization. J. Global Optim. 59(1), 59–80 (2014)
Tran-Dinh, Q., Necoara, I., Savorgnan, C., Diehl, M.: An inexact perturbed path-following method for Lagrangian decomposition in large-scale separable convex optimization. SIAM J. Optim. 23(1), 95–125 (2013)
Tran-Dinh, Q., Sun, T., Lu, S.: Self-concordant inclusions: a unified framework for path-following generalized Newton-type algorithms. Technical Report (submitted) (2016)
Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)
Verscheure, D., Demeulenaere, B., Swevers, J., De Schutter, J., Diehl, M.: Time-optimal path tracking for robots: a convex optimization approach. IEEE Trans. Autom. Control 54, 2318–2327 (2009)
Yamashita, M., Fujisawa, K., Kojima, M.: Implementation and evaluation of SDPA 6.0 (SemiDefinite Programming Algorithm 6.0). Optim. Method Softw. 18, 491–505 (2003)
Yang, T., Lin, Q.: RSG: beating SGD without smoothness and/or strong convexity. CoRR abs/1512.03107 (2016)
Zhang, Y., Lin, X.: DiSCO: Distributed optimization for self-concordant empirical loss. In: Proceedings of the 32th International Conference on Machine Learning, pp. 362–370 (2015)
Acknowledgements
This work is partially supported by the NSF-grant No. DMS-1619884, USA.
Author information
Authors and Affiliations
Corresponding author
Appendix: The proof of technical results
Appendix: The proof of technical results
This appendix provides the full proofs of technical results presented in this paper. We prove some technical results used in the paper, and missing proofs in the main text. We also provide a full convergence analysis of the Newton-type methods presented in the main text.
1.1 The proof of Proposition 6: Fenchel’s conjugate
Let us consider the set \(\mathcal {X}:= \{x\in \mathbb {R}^p \mid f(u) - \langle x, u\rangle ~\text {is bounded from below on}~\mathrm {dom}(f)\}\). We first show that \(\mathrm {dom}(f^{*}) = \mathcal {X}\).
By the definition of \(\mathrm {dom}(f^{*})\), we have \(\mathrm {dom}(f^{*}) = \left\{ x\in \mathbb {R}^p \mid f^{*}(x) < +\,\infty \right\} \). Take any \(x\in \mathrm {dom}(f^{*})\), one has \(f^{*}(x) = \max _{u\in \mathrm {dom}(f)}\left\{ \langle x, u\rangle - f(u)\right\} <+\,\infty \). Hence, \(f(u) - \langle x,u\rangle \ge -f^{*}(x) > -\infty \) for all \(u\in \mathrm {dom}(f)\) which implies \(x\in \mathcal {X}\).
Conversely, assume that \(x\in \mathcal {X}\). By the definition of \(\mathcal {X}\), \(f(u)-\langle x,u\rangle \) is bounded from below for all \(u\in \mathrm {dom}(f)\). That is, there exists \(M \in [0, +\,\infty )\), such that \(f(u) - \langle x, u\rangle \ge -M\) for all \(u\in \mathrm {dom}(f)\). By the definition of the conjugate, \(f^{*}(x) = \max _{u\in \mathrm {dom}(f)}\left\{ \langle x, u\rangle - f(u)\right\} \le M < +\,\infty \). Hence, \(x\in \mathrm {dom}(f^{*})\).
For any \(x\in \mathrm {dom}(f^{*})\), the optimality condition of \(\max _{u}\left\{ \langle x, u\rangle - f(u)\right\} \) is \(x = \nabla {f}(u)\). Let us denote by \(x(u) = \nabla {f}(u)\). Then, we have \(f^{*}(x(u)) = \langle x(u), u\rangle - f(u)\). Taking derivative of \(f^{*}\) with respect to x on both sides, and using \(x(u)=\nabla f(u)\), we have
We further take the second-order derivative of the above equation with respect to u to get
Using the two relations above and the fact that \(x_u'(u) = \nabla ^2{f}(u)\), we can derive
where \(u\in \mathrm {dom}(f)\), and \(v, w\in \mathbb {R}^p\). Using (50) and (51), we can compute the third-order derivative of \(f^{*}\) with respect to x(u) as
Denote \(\xi := x_u'(u)w\) and \(\eta := x_u'(u)v\). Since \(x_u'(u) = \nabla ^2{f}(u)\), we have \(\xi = \nabla ^2{f}(u)w\), \(\eta = \nabla ^2{f}(u)v\), and \(w = \nabla ^2{f}(u)^{-1}\xi \). Using these relations and \(\nabla ^2f^{*}(x(u))x_u'(u) = \mathbb {I}\), we can derive
For any \(H\in \mathcal {S}^p_{++}\), we have \(\langle H\xi , \xi \rangle \le \left\| H\xi \right\| _2\left\| \xi \right\| _2\). For any \(\nu \ge 3\), this inequality leads to
Using this inequality with \(H = \nabla ^2f^{*}(x(u))\) into the last expression, we obtain
By Definition 2, we need \(\nu - 3 = 3 - \nu _{*}\) and \(4-\nu = \nu _{*} - 2\) which hold if \(\nu _{*} = 6 - \nu \). Under the choice of \(\nu _{*}\), the above inequality shows that \(f^{*}\) is \((M_{f^{*}}, \nu _{*})\)-generalized self-concordant with \(M_{f^{*}} = M_f\) and \(\nu _{*} = 6 - \nu \). However, to guarantee \(\nu - 3 \ge 0\) and \(6 - \nu > 0\), we require \(3 \le \nu < 6\).
Finally, we prove the case of univariate functions, i.e., \(p = 1\). Indeed, we have
Here, \(f'\) is the derivative of f with respect to u. Taking the derivative of the last equation on both sides with respect to u, we obtain
Solving this equation for \((f^{*})'''(x(u))\) and then using (53) and \(x''(u) = f'''(u)\), we get
This inequality shows that \(f^{*}\) is generalized self-concordant with \(\nu _{*} = 6 - \nu \) for any \(\nu \in (0, 6)\). \(\square \)
1.2 The proof of Corollary 2: bound on the mean of Hessian operator
Let \(y_{\tau } := x+ \tau (y- x)\). Then \(d_{\nu }(x, y_{\tau }) = \tau d_{\nu }(x, y)\). By (15), we have \(\nabla ^2{f}(x+ \tau (y- x)) \preceq \left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{-2}{\nu -2}}\nabla ^2{f}(x)\) and \(\nabla ^2{f}(x+ \tau (y- x)) \succeq \left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{2}{\nu -2}}\nabla ^2{f}(x)\). Hence, we have
where \(\underline{I}_{\nu }(x, y) := \int _0^1\left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{2}{\nu -2}}d\tau \) and \(\overline{I}_{\nu }(x, y) := \int _0^1\left( 1 - \tau d_{\nu }(x,y)\right) ^{\frac{-2}{\nu -2}}d\tau \) are the two integrals in the above inequality. Computing these integrals explicitly, we can show that
-
If \(\nu = 4\), then \(\underline{I}_{\nu }(x,y) = \frac{1 - (1 - d_4(x,y))^2}{2d_4(x,y)}\) and \(\overline{I}_{\nu }(x, y) = \frac{-\ln (1 - d_4(x,y))}{d_4(x,y)}\).
-
If \(\nu \ne 4\), then we can easily compute \(\underline{I}_{\nu }(x, y) = \frac{(\nu -2)}{\nu d_{\nu }(x,y)}\left( 1 - \left( 1 - d_{\nu }(x,y)\right) ^{\frac{\nu }{\nu -2}}\right) \), and \(\overline{I}_{\nu }(x, y) = \frac{(\nu -2)}{(\nu -4)d_{\nu }(x,y)}\left( 1 - \left( 1 - d_{\nu }(x,y)\right) ^{\frac{\nu -4}{\nu -2}}\right) \).
Hence, we obtain (18).
Finally, we prove for the case \(\nu = 2\). Indeed, by (16), we have \(e^{-d_2(x,y_{\tau })}\nabla ^2f(x) \preceq \nabla ^2f(y_{\tau }) \preceq e^{d_2(x,y_{\tau })}\nabla ^2f(x)\). Since \(d_2(x, y_{\tau }) = \tau d_2(x, y)\), the last estimate leads to
which is exactly (18). \(\square \)
1.3 Techical lemmas
The following lemmas will be used in our analysis. Lemma 1 is elementary, but we provide its proof for completeness.
Lemma 1
-
(a)
For a fixed \(r \ge 1\) and \(\bar{t} \in (0, 1)\), consider a function \(\psi _r(t) := \frac{1 - (1-t)^r - rt(1-t)^r}{rt^2(1-t)^r}\) on \(t\in (0, 1)\). Then, \(\psi \) is positive and increasing on \((0, \bar{t}]\) and
$$\begin{aligned} \lim _{t\rightarrow 0^{+}}\psi _r(t) = \tfrac{r+1}{2},~~\lim _{t\rightarrow 1^{-}}\psi _r(t) = +\,\infty ,~~\text {and} ~~~\sup _{0 \le t \le \bar{t}}\left| \psi _r(t)\right| \le \bar{C}_r(\bar{t}) < +\,\infty , \end{aligned}$$where \(\bar{C}_r(\bar{t}) := \frac{1 - (1-\bar{t})^r - r\bar{t}(1-\bar{t})^r}{r\bar{t}^2(1-\bar{t})^r} \in (0, +\,\infty )\).
-
(b)
For \(t > 0\), we also have \(\frac{e^{t} - 1 - t}{t} \le \left( \frac{3}{2} + \frac{t}{3}\right) te^t\).
Proof
The statement \(\mathrm {(b)}\) is rather elementary, we only prove \(\mathrm {(a)}\). Since \(r \ge 1\), \(\lim _{t\rightarrow 0^{+}}(1 - (1-t)^r - rt(1-t)^r) = \lim _{t\rightarrow 0^{+}}rt^2(1-t)^r = 0\) and \(rt^2(1-t)^r > 0\) for \(t\in (0, 1)\), applying L’H\(\hat{\mathrm {o}}\)spital’s rule, we have
The limit \(\lim _{t\rightarrow 1^{-}}\psi _r(t) = +\,\infty \) is obvious.
Next, it is easily to compute \(\psi '_r(t) = \frac{(1-t)^{r+1}(rt+2)+(r+2)t-2}{rt^3(1-t)^{r+1}}\). Let \(m_r(t) := (1-t)^{r+1}(rt+2)+(r+2)t-2\) be the numerator of \(\psi '_r(t)\).
We have \(m_r'(t) = r+2 - (1-t)^r(r^2t+2rt+r+2)\), and \(m_r''(t) = r(r+1)(r+2)t(1-t)^{r-1}\). Clearly, since \(r \ge 1\), \(m_r''(t) \ge 0\) for \(t \in [0, 1]\). This implies that \(m_r'\) is nondecreasing on [0, 1]. Hence, \(m_r'(t) \ge m_r'(0) = 0\) for all \(t \in [0, 1]\). Consequently, \(m_r\) is nondecreasing on [0, 1]. Therefore, \(m_r(t) \ge m_r(0) = 0\) for all \(t\in [0, 1]\). Using the formula of \(\psi '_r\), we can see that \(\psi '_r(t) \ge 0\) for all \(t \in (0, 1)\). This implies that \(\psi _r\) is nondecreasing on (0, 1). Moreover, \(\lim _{t\rightarrow 0^+}\psi _r(t) = \frac{r+1}{2} > 0\). Hence, \(\psi _r(t) > 0\) for all \(t\in (0, 1)\). This implies that \(\psi _r\) is bounded on \((0, \bar{t}] \subset (0, 1)\) by \(\psi _r(\bar{t})\). \(\square \)
Similar to Corollary 2, we can prove the following lemma on the bound of the Hessian difference.
Lemma 2
Given \(x, y\in \mathrm {dom}(f)\), the matrix \(H(x,y)\) defined by
satisfies
where \(R_{\nu }(t)\) is defined as follows for \(t \in [0, 1)\):
Moreover, for a fixed \(\bar{t} \in (0, 1)\), we have \(\sup _{0 \le t \le \bar{t}}\left| R_{\nu }(t)\right| \le \bar{M}_{\nu }(\bar{t})\), where
Proof
By Corollary 2, if we define \(G(x,y) := \int _0^1 \left[ \nabla ^2{f}(x+ \tau (y-x)) - \nabla ^2{f}(x)\right] d\tau \), then
Since \(H(x,y) = \nabla ^2f(x)^{-1/2}G(x,y)\nabla ^2f(x)^{-1/2}\), the last inequality implies
Let \(C_{\max }(t) := \max \big \{1 - \underline{\kappa }_{\nu }(t), \overline{\kappa }_{\nu }(t) - 1 \big \}\) be for \(t \in [0, 1)\). We consider three cases.
(a) For \(\nu = 2\), since \(e^{-t} + e^{t} \ge 2\), we have \(\frac{1-e^{-t}}{t}+ \frac{e^{t}-1}{t} \ge 2\) which implies \(C_{\max }(t) = \overline{\kappa }_{\nu }(t) - 1 = \frac{e^{t}-1-t}{t}\). Hence, by Lemma 1, we have \(C_{\max }(t) \le \left( \frac{3}{2} + \frac{t}{3}\right) te^t\) which leads to \(R_{\nu }(t) := \left( \frac{3}{2} + \frac{t}{3}\right) e^t\).
(b) For \(\nu \in (2, 3]\), we have
Indeed, we show that \(\frac{(\nu - 2)}{(4 -\nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4-\nu }{\nu -2}}} - 1\Big ] + \frac{(\nu - 2)}{\nu t}\left[ 1 - (1 - t)^{\frac{\nu }{\nu - 2}}\right] \ge 2\). Let \(u := \frac{4-\nu }{\nu -2} > 0\) and \(v := \frac{\nu }{\nu -2} > 0\). The last inequality is equivalent to \(\frac{1}{u}\left[ \frac{1}{(1 - t)^u}-1\right] + \frac{1}{v}\left[ 1 - (1-t)^v\right] \ge 2t\) which can be reformulated as \(\frac{1}{v} - \frac{1}{u} + \frac{1}{u(1-t)^u} - \frac{(1-t)^v}{v} - 2t \ge 0\). Consider \(s(t) := \frac{1}{v} - \frac{1}{u} + \frac{1}{u(1-t)^u} - \frac{(1-t)^v}{v} - 2t\). It is clear that \(s'(t) = \frac{1}{(1-t)^{u+1}} + (1-t)^{v-1} - 2 = (1-t)^{-\frac{2}{\nu -2}} + (1-t)^{\frac{2}{\nu -2}} - 2 \ge 0\) for all \(t\in [0, 1)\). We obtain \(s(t) \ge s(0) = 0\). Hence, \(C_{\max }(t) = \frac{(\nu - 2)}{(4 - \nu ) t}\Big [\frac{1}{(1 - t)^{\frac{4-\nu }{\nu -2}}} - 1\Big ] - 1\).
Let us define \(r := \frac{4-\nu }{\nu -2} = \frac{2}{\nu -2} - 1\). Then, it is clear that \(\nu = 2 + \frac{2}{1+r}\), and \(\nu \in (2, 3]\) is equivalent to \(r \ge 1\). Now, using Lemma 1 with \(r = \frac{2}{\nu -2} - 1 \ge 1\), we obtain \(R_{\nu }(t) := \frac{1 - (1-t)^{\frac{4-\nu }{\nu -2}} - \left( \frac{4-\nu }{\nu -2}\right) t(1-t)^{\frac{4-\nu }{\nu -2}}}{\left( \frac{4-\nu }{\nu -2}\right) t^2(1-t)^{\frac{4-\nu }{\nu -2}}}\). Put (a) and (b) together, we obtain (55) with \(R_{\nu }\) defined by (56). The boundedness of \(R_{\nu }\) follows from Lemma 1. \(\square \)
1.4 The proof of Theorem 4: solution existence and uniqueness
Consider a sublevel set \(\mathcal {L}_F(x):=\left\{ y\in \mathrm {dom}(F) \mid F(y)\le F(x)\right\} \) of F in (32). For any \(y\in \mathcal {L}_F(x)\) and \(v\in \partial g(x)\), by (22) and the convexity of g, we have
By the Cauchy-Schwarz inequality, we have
Now, using the assumption \(\nabla ^2{f}(x)\succ 0\) for some \(x\in \mathrm {dom}(F)\), we have \(\sigma _{\min }(x) := \lambda _{\min }(\nabla ^2{f}(x)) > 0\), the smallest eigenvalue of \(\nabla ^2{f}(x)\).
-
(a)
If \(\nu = 2\), then \(d_2(x,y)=M_f\left\| y-x\right\| _2\le \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\left\| y-x\right\| _{x}\). This estimate together with (58) imply
$$\begin{aligned} \omega _2\left( -d_2(x, y)\right) d_2(x,y)\le \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\left\| \nabla f(x)+v\right\| _{x}^{*} = \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x). \end{aligned}$$(59)We consider the function \(s_2(t) := \omega _2(-t)t = 1 - \frac{1-e^{-t}}{t}\). Clearly, \(s_2'(t) = \frac{e^t - t - 1}{t^2e^t} > 0\) for all \(t \in \mathbb {R}_{+}\). Hence, \(s_2(t)\) is increasing on \(\mathbb {R}_{+}\). However, \(s_2(t) < 1\) and \(\lim \limits _{t\rightarrow +\,\infty }s_2(t) = 1\). Therefore, if \(\frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x) < 1\), then the equation \(s_2(t) - \frac{M_f}{\sqrt{\sigma _{\min }(x)}}\lambda (x) = 0\) has a unique solution \(t^{*} \in (0, +\,\infty )\). In this case, for \(0 \le d_2(x, y) \le t^{*}\), (59) holds. This condition leads to \(M_f\left\| y-x\right\| _2 \le t^{*} <+\,\infty \) which implies that the sublevel set \(\mathcal {L}_F(x)\) is bounded. Consequently, solution \(x^{\star }\) of (32) exists.
-
(b)
If \(2< \nu < 3\), then
$$\begin{aligned} d_{\nu }(x,y)\le \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\left\| y-x\right\| _{x}. \end{aligned}$$This inequality together with (58) imply
$$\begin{aligned} \omega _{\nu }\left( -d_{\nu }(x, y)\right) d_{\nu }(x,y)\le & {} \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\left\| \nabla f(x)+v\right\| _{x}^{*}\\= & {} \left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\lambda (x). \end{aligned}$$We consider \(s_{\nu }(t) := \omega _{\nu }(-t)t\). After a few elementary calculations, we can easily check that \(s_{\nu }\) is increasing on \(\mathbb {R}_{+}\) and \(s_{\nu }(t) < \frac{\nu -2}{4-\nu }\) for all \(t > 0\), and \(\lim \limits _{t\rightarrow +\,\infty }s_{\nu }(t) = \frac{\nu -2}{4-\nu }\). Hence, if \(\left( \frac{\nu }{2}-1\right) \frac{M_f}{\sigma _{\min }(x)^{\frac{3-\nu }{2}}}\lambda (x) < \frac{\nu -2}{4-\nu }\), then, similar to Case (a), we can show that solution \(x^{\star }\) of (32) exists. This condition implies that \(\lambda (x) < \frac{2\sigma _{\min }(x)^{\frac{3-\nu }{2}}}{(4-\nu )M_f}\).
-
(c)
If \(\nu = 3\), then \(d_3(x,y) = \frac{M_f}{2}\left\| y-x\right\| _{x}\). Combining this estimate and (58) we get
$$\begin{aligned} \omega _3\left( -d_3(x, y)\right) d_3(x,y)\le \frac{M_f}{2}\left\| \nabla f(x)+v\right\| _{x}^{*}. \end{aligned}$$With the same proof as in [40, Theorem 4.1.11], if \(\frac{M_f}{2}\left\| \nabla f(x)+v\right\| _{x}^{*} < 1\) which is equivalent to \(\lambda (x) < \frac{2}{M_f}\), then solution \(x^{\star }\) of (32) exists.
Note that the condition on \(\lambda (x)\) in three cases (a), (b), and (c) can be unified. The uniqueness of the solution \(x^{\star }\) in these three cases follows from the strict convexity of F. \(\square \)
1.5 The proof of Theorem 2: convergence of the damped-step Newton method
The proof of this theorem is divided into two parts: computing the step-size, and proving the local quadratic convergence.
Computing the step-size\(\tau _k\): From Proposition 10, for any \(x^k,x^{k+1}\in \mathrm {dom}(f)\), if \(d_{\nu }(x^k,x^{k+1}) < 1\), then we have
Now, using (25), we have \(\langle \nabla {f}(x^k), x^{k+1}-x^k\rangle = -\tau _k\left( \Vert \nabla {f}(x^k)\Vert _{x^k}^{*}\right) ^2 = -\tau _k\lambda _k^2\). On the other hand, we have
Using the definition of \(d_{\nu }(\cdot )\) in (12), the two last equalities, and (28), we can easily show that \(d_{\nu }(x^k, x^{k+1}) = \tau _kd_k\). Substituting these relations into the first estimate, we obtain
We consider the following cases:
(a) If \(\nu = 2\), then, by (23), we have \(\eta _k(\tau ) := \lambda _k^2\tau - \left( \frac{\lambda _k}{d_k}\right) ^2\left( e^{\tau d_k} - \tau d_k - 1\right) \) with \(d_k = \beta _k\). This function attains the maximum at \(\tau _k := \frac{\ln (1 + d_k)}{d_k} = \frac{\ln (1 + \beta _k)}{\beta _k} \in (0, 1)\) with
It is easy to check from the right-most term of the last expression that \(\varDelta _k := \eta _k(\tau _k) > 0\) for \(\tau _k > 0\).
(b) If \(\nu = 3\), then, by (23), we have \(\eta _k(\tau ) := \lambda _k^2\tau + \left( \frac{\lambda _k}{d_k}\right) ^2\left[ \tau d_k + \ln (1 - \tau d_k)\right] \) with \(d_k = 0.5M_f\lambda _k\). We can show that \(\eta _k(\tau )\) achieves the maximum at \(\tau _k = \frac{1}{1 + d_k} = \frac{1}{1 + 0.5M_f\lambda _k}\in (0,1)\) with
We can also easily check that the last term \(\varDelta _k := \eta _k(\tau _k)\) of this expression is positive for \(\lambda _k > 0\).
(c) If \(2< \nu < 3\), then we have \(d_k=M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \lambda _k^{\nu -2}\beta _k^{3-\nu }\). By (23), we have
Our aim is to find \(\tau ^{*} \in (0, 1]\) by solving \(\max _{\tau \in [0, 1]}\eta _k(\tau )\). This problem always has a global solution. First, we compute the first- and the second-order derivatives of \(\eta _k\) as follows:
Let us set \(\eta _k'(\tau _k) = 0\). Then, we get
with
In addition, we can check that \(\eta _k''(\tau _k) < 0\). Hence, the value of \(\tau _k\) above achieves the maximum of \(\eta _k(\cdot )\). Then, we have \(\varDelta _k := \eta _k(\tau _k) > \eta _k(0)=0\).
The proof of local quadratic convergence Let \(x^{\star }_f\) be the optimal solution of (24). We have
Hence, we can write
Let us define \(T_k := \Big \Vert \nabla ^2{f}(x^k)^{-1}\left[ \nabla {f}(x^{\star }_f) - \nabla {f}(x^k) - \nabla ^2{f}(x^k)(x^{\star }_f - x^k)\right] \Big \Vert _{x^k}\) and consider three cases as follows:
\(\mathrm {(a)}\) For \(\nu = 2\), using Corollary 2, we have \(\left( \frac{1-e^{-\bar{\beta }_k}}{\bar{\beta }_k}\right) \nabla ^2{f}(x^k) \preceq \int _0^1\nabla ^2{f}(x^k + t(x^{\star }_f -x^k))dt \preceq \left( \frac{e^{\bar{\beta }_k}-1}{\bar{\beta }_k}\right) \nabla ^2{f}(x^k)\), where \(\bar{\beta }_k := M_f\Vert x^k - x^{\star }_f\Vert _2\). Using the above inequality, we can show that
Let \(\underline{\sigma }_k := \lambda _{\min }(\nabla ^2{f}(x^k))\). We first derive
where \(K(x^k,x^{\star }_f) :=\int _0^1 \nabla ^2{f}(x^k)^{-1/2}\nabla ^2{f}(x^k + t(x^{\star }_f - x^k) \nabla ^2{f}(x^k)^{-1/2}dt\). Using Corollary 2 and noting that \(\bar{\beta }_k := M_f\Vert x^k - x^{\star }_f\Vert _2\), we can estimate \(\Vert K(x^k,x^{\star }_f)\Vert \le \frac{e^{\bar{\beta }_k} - 1}{\bar{\beta }_k}\). Using the two last estimates, and the definition of \(\beta _k\), we can derive
provided that \(\bar{\beta }_k \le 1\). Since, the step-size \(\tau _k = \frac{1}{\beta _k}\ln (1+\beta _k)\), we have \(1 - \tau _k \le \frac{\beta _k}{2} \le \frac{M_fe\Vert x^k - x^{\star }_f\Vert _{x^k}}{2\sqrt{\underline{\sigma }_k}}\). On the other hand, \(\frac{e^{\bar{\beta }_k}-1 - \bar{\beta }_k}{\bar{\beta }_k^2} \le \frac{e}{2}\) for all \(0\le \bar{\beta }_k \le 1\). Substituting \(T_k\) into (60) and using these relations, we have
provided that \(\bar{\beta }_k \le 1\). On the other hand, by Proposition 8, we have \(\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le e^{\frac{\bar{\beta }_{k+1} + \bar{\beta }_k}{2}}\Vert x^{k+1} - x^{\star }_f\Vert _{x^k}\) and \(\underline{\sigma }_{k+1}^{-1} \le e^{\bar{\beta }_k + \bar{\beta }_{k+1}}\underline{\sigma }_k^{-1}\). In addition, \(\bar{\beta }_k \le \frac{M_f}{\sqrt{\underline{\sigma }_k}}\Vert x^{k} - x^{\star }_f\Vert _{x^k}\) Combining the above inequalities, we finally get
Under the fact that \(\beta _k\le 1\), and \(\beta _{k+1} \le 1\), this estimate shows that \(\left\{ \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}}\right\} \) quadratically converges to zero. Since \(\Vert x^k - x^{\star }_f\Vert _2 \le \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\sqrt{\underline{\sigma }_k}}\), we can also conclude that \(\left\{ \Vert x^k - x^{\star }_f\Vert _2\right\} \) quadratically converges to zero.
\(\mathrm {(b)}\) For \(\nu = 3\), we can follow [40]. However, for completeness, we give a short proof here. Using Corollary 2, we have \(\left( 1 - r_k + \frac{r_k^2}{3}\right) \nabla ^2{f}(x^k) \preceq \int _0^1\nabla ^2{f}(x^k + t(x^{\star }_f -x^k))dt \preceq \frac{1}{1-r_k}\nabla ^2{f}(x^k)\), where \(r_k := 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k} < 1\). Using the above inequality, we can show that
Substituting \(T_k\) into (60) and using \(\tau _k = \frac{1}{1 + 0.5M_f\lambda _k}\), we have
Next, we need to upper bound \(\lambda _k\). Since \(\nabla {f}(x^{\star }_f) = 0\). Using Corollary 2, we can bound \(\lambda _k\) as
provided that \(M_f\Vert x^k - x^{\star }_f\Vert _{x^k} < 1\). Overestimating the above inequality using this bound, we get
provided that \(M_f\Vert x^k-x_f^{\star }\Vert _{x^k} < 1\). On the other hand, we can also estimate \(\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le \frac{\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k}}}{1 - 0.5M_f\left( \Vert x^{k+1} - x^{\star }_f\Vert _{x^{k}} + \Vert x^k - x^{\star }_f\Vert _{x^k}\right) }\). Combining the last two inequalities, we get
The right-hand side function \(\psi (t) = \frac{2M_f}{1 - 2M_ft^2 - 0.5M_ft} \le 4M_f\) on \(t \in \left[ 0, \frac{1}{2M_f} \right] \). Hence, if \(\Vert x^k - x^{\star }_f\Vert _{x^k} \le \frac{1}{2M_f}\), then \(\Vert x^{k+1} - x^{\star }_f\Vert _{x^{k+1}} \le 4M_f\Vert x^k - x^{\star }_f\Vert _{x^k}^2\). This shows that if \(x^0\in \mathrm {dom}(f)\) is chosen such that \(\Vert x^0 - x^{\star }_f\Vert _{x^0} \le \frac{1}{4M_f}\), then \(\left\{ \Vert x^k - x^{\star }_f\Vert _{x^k}\right\} \) quadratically converges to zero.
\(\mathrm {(c)}\) For \(\nu \in (2, 3)\), with the same argument as in the proof of Theorem 3, we can show that
where \(R_{\nu }\) is defined by (56) and \(d_{\nu }^k := M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \Vert x^k-x^{\star }_f\Vert _2^{3-\nu }\Vert x^k - x^{\star }_f\Vert _{x^k}^{\nu -2}\). Using again the argument as in the proof of Theorem 3, we have
Here, \(C_{\nu }(\cdot ,\cdot )\) is a given function deriving from \(R_{\nu }\). Under the condition that \(d^k_{\nu }\) and \(\Vert x^k - x^{\star }_f\Vert _{x^k}\) are sufficiently small, we can show that \(C_{\nu }(d^k_{\nu },\Vert x^k - x^{\star }_f\Vert _{x^k}) \le \bar{C}_{\nu }\). Hence, the last inequality shows that \(\Big \{ \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\underline{\sigma }_k^{\frac{3-\nu }{2}} } \Big \}\) quadratically converges to zero. Since \(\underline{\sigma }_k^{\frac{3-\nu }{2}}\Vert x^k -x^{\star }_f\Vert _{H_k} \le \Vert x^k - x^{\star }_f\Vert _{x^k}\), where \(H_k := \nabla ^2{f}(x^k)^{\frac{\nu -2}{2}}\), we have \(\Vert x^k -x^{\star }_f\Vert _{H_k} \le \frac{\Vert x^{k} - x^{\star }_f\Vert _{x^k}}{\underline{\sigma }_k^{\frac{3-\nu }{2}} }\). Hence, we can conclude that \(\left\{ \Vert x^k -x^{\star }_f\Vert _{H_k}\right\} \) also locally converges to zero at a quadratic rate. \(\square \)
1.6 The proof of Theorem 3: the convergence of the full-step Newton method
We divide this proof into two parts: the quadratic convergence of \(\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}\), and the quadratic convergence of \(\big \{\Vert x^k - x^{\star }_f\Vert _{H_k}\big \}\).
The quadratic convergence of\(\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}\): Since the full-step Newton scheme updates \(x^{k+1} := x^k - \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)\), if we denote by \(n_{\mathrm {nt}}^k = x^{k+1} -x^k = - \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)\), then the last expression leads to \(\nabla {f}(x^k) + \nabla ^2f(x^k)n_{\mathrm {nt}}^k = 0\). In addition, \(\Vert n_{\mathrm {nt}}^k\Vert _{x^k} = \Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = \lambda _k\). Using the definition of \(d_{\nu }(\cdot ,\cdot )\) in (12), we denote \(d^k_{\nu } := d_{\nu }(x^k, x^{k+1})\).
First, by \(\nabla {f}(x^k) + \nabla ^2f(x^k)n_{\mathrm {nt}}^k = 0\) and the mean-value theorem, we can show that
Let us define \(G_k := \int _0^1\left[ \nabla ^2{f}(x^k + tn_{\mathrm {nt}}^k) - \nabla ^2{f}(x^k)\right] dt\) and \(H_k := \nabla ^2{f}(x^k)^{-1/2}G_k\nabla ^2{f}(x^k)^{-1/2}\). Then, the above estimate implies \( \nabla {f}(x^{k+1}) = G_kn_{\mathrm {nt}}^k\). Hence, we can show that
By Lemma 2, we can estimate
where \(R_{\nu }\) is defined by (56). Combining the two last inequalities and using Proposition 8, we consider the following cases:
(a) If \(\nu = 2\), then we have \(\lambda _{k+1}^2 \le e^{d_2^k}\left[ \left\| \nabla {f}(x^{k+1})\right\| _{x^k}^{*}\right] ^2\) which implies \(\lambda _{k+1} \le e^{\frac{d_2^k}{2}}R_2(d_2^k)d_2^k\lambda _k\). Note that \(\lambda _k \ge \frac{\sqrt{\underline{\sigma }_k}d_2^k}{M_f}\) and \(\frac{1}{\underline{\sigma }_{k+1}}\le \frac{e^{d_2^k}}{\underline{\sigma }_k}\). Based on the above inequality, we have
By a numerical calculation, we can easily check that if \(d_2^k < d_2^{\star }\approx 0.12964\), then
Consequently, if \(\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} < \frac{1}{M_f}\min \left\{ d_2^{\star },0.5\right\} = \frac{d_2^{\star }}{M_f}\), then we can prove
by induction. Under the condition \(\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} < \frac{d_2^{\star }}{M_f}\), the above inequality shows that the ratio \(\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} \) converges to zero at a quadratic rate.
Now, if \(\nu > 2\), then we consider different cases. Note that
which follows that
Note that \(d_{\nu }^k=\left( \frac{\nu }{2}-1\right) M_f\left\| d^k\right\| _2^{3-\nu }\lambda _k^{\nu -2}\) and \(\underline{\sigma }_{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\underline{\sigma }_k^{-1}\). Based on these relations and (61) we can argue as follows:
\(\mathrm {(b)}\) If \(2< \nu < 3\), then \(\lambda _k \ge \left\| d^k\right\| _2\sqrt{\underline{\sigma }_k}\) which follows that \(d_{\nu }^k\le \left( \frac{\nu }{2}-1\right) M_f\underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\). Hence,
If \(d_{\nu }^k < d_{\nu }^{\star }\), where \(d_{\nu }^{\star }\) is the unique solution to the equation
then \(\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1}\le 2M_f\left( \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k \right) ^2\). Note that it is straightforward to check that this equation always admits a positive solution. Hence, if we choose \(x^0\in \mathrm {dom}(f)\) such that \(\underline{\sigma }_0^{-\frac{3-\nu }{2}}\lambda _0 < \frac{1}{M_f}\min \left\{ \frac{2d_{\nu }^{\star }}{\nu -2},\frac{1}{2}\right\} \), then we can prove the following two inequalities together by induction:
In addition, the above inequality also shows that \(\left\{ \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\right\} \) quadratically converges to zero.
\(\mathrm {(c)}\) If \(\nu = 3\), then \(d_3^k= \frac{M_f}{2}\lambda _k\), and
Directly checking the right-hand side of the above estimate, one can show that if \(d_3^k < d_3^{\star }=0.5\), then \(\lambda _{k+1}\le 2M_f\lambda _k^2\). Hence, if \(\lambda _0 < \frac{1}{M_f}\min \left\{ 2d_3^{\star },0.5\right\} = \frac{1}{2M_f}\), then we can prove the following two inequalities together by induction:
Moreover, the first inequality above also shows that \(\left\{ \lambda _k\right\} \) converges to zero at a quadratic rate.
The quadratic convergence of\(\big \{\Vert x^k - x^{\star }_f\Vert _{H_k}\big \}\): First, using Proposition 9 with \(x:= x^k\) and \(y= x^{\star }_f\), and noting that \(\nabla {f}(x^{\star }_f) = 0\), we have
where the last inequality follows from the Cauchy-Schwarz inequality. Hence, we obtain
We consider three cases:
(1) When \(\nu = 2\), we have \(\bar{\omega }_{\nu }(\tau ) = \frac{e^\tau -1}{\tau }\). Hence, \(\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \frac{1 - e^{-d_{\nu }(x^k, x^{\star }_f)}}{d_{\nu }(x^k, x^{\star }_f)} \ge 1 - \frac{d_{\nu }(x^k, x^{\star }_f)}{2} \ge \frac{1}{2}\) whenever \(d_{\nu }(x^k, x^{\star }_f) \le 1\). Using this inequality in (62), we have \(\Vert x^k - x^{\star }_f\Vert _{x^k} \le 2\Vert \nabla {f}(x^k)\Vert _{x^k}^{*} = 2\lambda _k\) provided that \(d_{\nu }(x^k, x^{\star }_f) \le 1\). One the other hand, by the definition of \(\underline{\sigma }_k\), we have \(\sqrt{\underline{\sigma }_k}\Vert x^k - x^{\star }_f\Vert _2 \le \Vert x^k - x^{\star }_f\Vert _{x^k}\). Combining the two last inequalities, we obtain \(\Vert x^k - x^{\star }_f\Vert _2 \le \frac{2\lambda _k}{\sqrt{\underline{\sigma }_k}}\) provided that \(d_{\nu }(x^k, x^{\star }_f) \le 1\). Since \(\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} \) locally converges to zero at a quadratic rate, the last relation also shows that \(\big \{\Vert x^k - x^{\star }_f\Vert _2\big \}\) also locally converges to zero at a quadratic rate.
(2) For \(\nu = 3\), we have \(\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \frac{1}{1 + d_{\nu }(x^k, x^{\star }_f)}\) and \(d_{\nu }(x^k, x^{\star }_f) = \frac{M_f}{2}\Vert x^k - x^{\star }_f\Vert _{x^k}\). Hence, from (62), we obtain \(\frac{\Vert x^k - x^{\star }_f\Vert _{x^k} }{1 + 0.5M_f\Vert x^k - x^{\star }_f\Vert _{x^k} } \le \lambda _k\). This implies \(\Vert x^k - x^{\star }_f\Vert _{x^k} \le \frac{\lambda _k}{1 - 0.5M_f\lambda _k}\) as long as \(0.5M_f\lambda _k < 1\). Clearly, since \(\lambda _k\) locally converges to zero at a quadratic rate, \(\Vert x^k - x^{\star }_f\Vert _{x^k}\) also locally converges to zero at a quadratic rate.
(3) For \(2< \nu < 3\), we have \(\bar{\omega }_{\nu }(-d_{\nu }(x^k, x^{\star }_f)) = \left( \frac{\nu -2}{\nu -4}\right) \frac{\left( 1 + d_{\nu }(x^k, x^{\star }_f) \right) ^{\frac{\nu -4}{\nu -2}} - 1}{d_{\nu }(x^k, x^{\star }_f)} \ge 1 - \frac{1}{\nu -2}d_{\nu }(x^k, x^{\star }_f) \ge \frac{1}{2}\) provided that \(d_{\nu }(x^k, x^{\star }_f) < \frac{\nu }{2}-1\). Similar to the case \(\nu = 2\), we have \(\underline{\sigma }_k^{\frac{3-\nu }{2}}\Vert x^k -x^{\star }_f\Vert _{H_k} \le \Vert x^k - x^{\star }_f\Vert _{x^k} \le 2\lambda _k\), where \(H_k := \nabla ^2{f}(x^k)^{\frac{\nu -2}{2}}\). Hence, \(\Vert x^k -x^{\star }_f\Vert _{H_k} \le \frac{2\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\). Since \(\big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\big \}\) locally converges to zero at a quadratic rate, \(\big \{\Vert x^k -x^{\star }_f\Vert _{H_k} \big \}\) also locally converges to zero at a quadratic rate. \(\square \)
1.7 The proof of Theorem 5: convergence of the damped-step PN method
Given \(H\in \mathcal {S}^p_{++}\) and a proper, closed, and convex function \(g : \mathbb {R}^p\rightarrow \mathbb {R}\cup \{+\,\infty \}\), we define
If \(H= \nabla ^2{f}(x)\) is the Hessian mapping of a strictly convex function f, then we can also write \(\mathcal {P}_{\nabla ^2 f(x)}(u)\) shortly as \(\mathcal {P}_{x}(u)\) for our notational convenience. The following lemma will be used in the sequel whose proof can be found in [62].
Lemma 3
Let \(g : \mathbb {R}^p\rightarrow \mathbb {R}\cup \{+\,\infty \}\) be a proper, closed, and convex function, and \(H\in \mathcal {S}^p_{++}\). Then, the mapping \(\mathcal {P}_{H}^g\) defined above is non-expansive with respect to the weighted norm defined by \(H\), i.e., for any \(u,v\in \mathbb {R}^p\), we have
Let us define
for any vectors \(x,u\in \mathrm {dom}(f)\) and \(v\in \mathbb {R}^p\). We now prove Theorem 5 in the main text.
The proof of Theorem 5
Computing the step-size\(\tau _k\): Since \(z^k\) satisfies the optimality condition (36), we have
Using Proposition 10 we obtain
Since \(x^{k+1}=(1-\tau _k)x^k+\tau _kz^k\), using this relation and the convexity of g, we have
Summing up the last two inequalities, we obtain the following estimate
With the same argument as in the proof of Theorem 2, we obtain the conclusion of Theorem 5.
The proof of local quadratic convergence We consider the distance between \(x^{k+1}\) and \(x^{\star }\) measured by \(\Vert x^{k+1}-x^{\star }\Vert _{x^{\star }}\). By the definition of \(x^{k+1}\), we have
Using the new notations in (64), it follows from the optimality condition (33) and (36) that \(z^k = \mathcal {P}^g_{x^{\star }}(S_{x^{\star }}(x^k)+e_{x^{\star }}(x^k,z^k))\) and \(x^{\star }=\mathcal {P}^g_{x^{\star }}(S_{x^{\star }}(x^{\star }))\). By Lemma 3 and the triangle inequality, we can show that
By following the same argument as in [62], if we apply Lemma 2, then we can derive
where \(R_{\nu }(\cdot )\) is defined by (56).
Next, using the same argument as the proof of (72) in Theorem 6 below, we can bound the second term \(\Vert e_{x^{\star }}(x^k,z^k) \Vert ^{*}_{x^{\star }}\) of (66) as
Combining this inequality, (66), (67), and the triangle inequality, we obtain
where \(\hat{R}_{\nu }\) and \(\tilde{R}_{\nu }\) are defined as
respectively. After a few simple calculations, one can show that there exists a constant \(c_{\nu } \in (0, +\,\infty )\) such that if \(t\in [0,\bar{d}_{\nu }]\), then both \(\hat{R}_{\nu }(t)\) and \(\tilde{R}_{\nu }(t)\in [0,c_{\nu }]\) (when \(t \rightarrow 0+\), consider the limit), where \(\bar{d}_2:=\frac{3}{5}\) and \(\bar{d}_{\nu }:= 1-\left( \frac{2}{3}\right) ^{\frac{\nu -2}{2}}\) for \(\nu > 2\), respectively. Using this bound, (65), (68), and the fact that \(\tau _k \le 1\), we can bound
Let \(\underline{\sigma }^{\star } := \sigma _{\min }(\nabla ^2 f(x^{\star }))\) be the smallest eigenvalue of \(\nabla ^2 f(x^{\star })\). We consider the following cases:
(a) If \(\nu =2\), then, for \(0 \le d_{\nu }(x^{\star }, x^k) \le \bar{d}_{\nu }\), we can bound \(1-\tau _k\) as
On the other hand, we have \(d_{\nu }(x^{\star },x^k)=M_f\Vert x^k - x^{\star } \Vert _2 \le \frac{M_f}{\sqrt{\underline{\sigma }^{\star }}}\Vert x^k-x^{\star }\Vert _{x^{\star }}\). Using these estimates into (69), we get
Let \(c^{\star }_{\nu } := \frac{3c_{\nu }M_f}{2\sqrt{\underline{\sigma }^{\star }}}\). The last estimate shows that if \(\Vert x^0 - x^{\star }\Vert _{x^{\star }} \le \min \left\{ \frac{ \bar{d}_{\nu }\sqrt{\underline{\sigma }^{\star }}}{M_f}, \frac{1}{c^{\star }_{\nu }}\right\} \), then \(\left\{ \Vert x^k - x^{\star }\Vert _{x^{\star }}\right\} \) quadratically converges to zero.
(b) If \(2 < \nu \le 3\), then we first show that
Hence, if \(\Vert x^k-x^{\star }\Vert _{x^{\star }}\le m_{\nu }\bar{d}_{\nu }\), where \(m_{\nu } := \tfrac{2}{\nu -2}\frac{\left( \underline{\sigma }^{\star }\right) ^{\frac{3-\nu }{2}}}{M_f}\), then \(d_{\nu }(x^{\star },x^k)\le \bar{d}_{\nu }\). Next, using the definition of \(d_k\) in (28), we can bound it as
Using this estimate, we can bound \(1-\tau _k\) as follows:
where \(n_{\nu } := \frac{(4 -\nu )M_f}{2(1-\bar{d}_{\nu })(\underline{\sigma }^{\star })^{\frac{3-\nu }{2}}}c_{\nu } > 0\). Substituting this estimate into (69) and noting that \(d_{\nu }(x^{\star }, x^k) \le \frac{1}{m_{\nu }}\Vert x^k - x^{\star }\Vert _{x^{\star }}\), we get
Hence, if \(\Vert x^0 - x^{\star }\Vert _{x^{\star }} \le \min \left\{ m_{\nu }\bar{d}_{\nu }, \frac{1}{c^{\star }_{\nu }}\right\} \), then the last estimate shows that the sequence \(\left\{ \Vert x^k - x^{\star }\Vert _{x^{\star }}\right\} \) quadratically converges to zero.
In summary, there exists a neighborhood \(\mathcal {N}(x^{\star })\) of \(x^{\star }\), such that if \(x^0\in \mathcal {N}(x^{\star })\cap \mathrm {dom}(F)\), then the whole sequence \(\left\{ \Vert x^k-x^{\star }\Vert _{x^{\star }}\right\} \) quadratically converges to zero. \(\square \)
1.8 The proof of Theorem 6: locally quadratic convergence of the PN method
Since \(z^k\) is the optimal solution to (35) which satisfies (36), we have \(\nabla ^2 f(x^k)x^k-\nabla f(x^k)\in (\nabla ^2 f(x^k) + \partial g)(z^k)\). Using this optimality condition, we get
Let us define \(\tilde{\lambda }_{k+1}:=\Vert n_{\mathrm {pnt}}^{k+1}\Vert _{x^k}\). Then, by Lemma (3) and the triangular inequality, we have
Let us first bound the term \(\left\| S_{x^k}(x^{k+1})-S_{x^k}(x^k)\right\| _{x^k}^{*}\) as follows:
where \(R_{\nu }(t)\) is defined as (56). Indeed, from the mean-value theorem, we have
where \(H\) is defined as (54). Combining the above inequality and (56) in Lemma 2, we get (71).
Next we bound the term \(\left\| e_{x^k}(x^{k+1},z^{k+1})\right\| _{x^k}^{*}\) as follows:
Note that
where
By Proposition 8, we have
This inequality can be simplified as
Hence, the inequality (72) holds.
Now, we combine (70), (71), and (72), if \(\nu = 2\), and assuming that \(d_2^k < \ln 2\), then we get
By Proposition 8, we have \(\lambda _{k+1}^2\le e^{d_{\nu }^k}\tilde{\lambda }_{k+1}^2\). Combining this estimate and the last inequality, we get
Note that \(\lambda _k \ge \frac{\sqrt{\underline{\sigma }_k}d_2^k}{M_f}\) and \(\underline{\sigma }_{k+1}^{-1}\le e^{d_2^k}\underline{\sigma }_k^{-1}\). It follows from (74) that
By a numerical calculation, we can check that if \(d_2^k \le d_2^{\star }\approx 0.35482\), then
Hence, if we choose \(x^0\in \mathrm {dom}(F)\) such that \(\frac{\lambda _0}{\sqrt{\underline{\sigma }_0}} \le \frac{1}{M_f}\min \left\{ d_2^{\star },0.5\right\} = \frac{d_2^{\star }}{M_f}\), then we can prove the following two inequalities together by induction:
These inequalities show the nonincreasing monotonicity of \(\left\{ d_2^k\right\} \) and \(\left\{ \lambda _k\right\} \). The above inequality also shows the local quadratic convergence of the sequence \(\left\{ \frac{\lambda _k}{\sqrt{\underline{\sigma }_k}}\right\} \).
Now, if \(\nu > 2\) and assume that \(d_{\nu }^k < 1- \left( {\frac{1}{2}}\right) ^{\frac{\nu -2}{2}}\), then
By Proposition 8, we have \(\lambda _{k+1}^2 \le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\tilde{\lambda }_{k+1}^2\). Hence, combining these inequalities, we get
Note that \(d_{\nu }^k=\left( \frac{\nu }{2}-1\right) M_f\left\| p^k\right\| _2^{3-\nu }\lambda _k^{\nu -2}\), \(\underline{\sigma }_{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\underline{\sigma }_k^{-1}\) and \(\sigma _{k+1}^{-1}\le (1-d_{\nu }^k)^{\frac{-2}{\nu -2}}\sigma _k^{-1}\). Using these relations and (75), we consider two cases:
(a) If \(\nu = 3\), then \(d_3^k = \frac{M_f}{2}\lambda _k\), and
By a simple numerical calculation, we can show that if \(d_3^k \le d_3^{\star }\approx 0.20943\), then \(\lambda _{k+1}\le 2M_f\lambda _k^2\). Hence, if \(\lambda _0 < \frac{1}{M_f}\min \left\{ 2d_3^{\star },0.5\right\} = \frac{2}{M_f}d_3^{\star }\), then we can prove the following two inequalities together by induction
These inequalities show the non-increasing monotonicity of \(\left\{ d_2^k\right\} \) and \(\left\{ \lambda _k\right\} \). The above inequality also shows the quadratic convergence of the sequence \(\left\{ \lambda _k\right\} \).
(b) If \(2< \nu < 3\), then \(\lambda _k \ge \Vert p^k\Vert _2\sqrt{\underline{\sigma }_k}\) which implies that \(d_{\nu }^k\le \left( \frac{\nu }{2}-1\right) M_f\underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\). Hence, we have
If \(d_{\nu }^k < d_{\nu }^{\star }\), then \(\underline{\sigma }_{k+1}^{-\frac{3-\nu }{2}}\lambda _{k+1}\le 2M_f\left( \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k \right) ^2\), where \(d_{\nu }^{\star }\) is the unique solution to the equation
Note that it is straightforward to check that this equation always admits a positive solution. Therefore, if \(\underline{\sigma }_0^{-\frac{3-\nu }{2}}\lambda _0 \le \frac{1}{M_f}\min \left\{ \frac{2d_{\nu }^{\star }}{\nu -2},\frac{1}{2}\right\} \), then we can prove the following two inequalities together by induction:
These inequalities show the non-increasing monotonicity of \(\left\{ d_2^k\right\} \) and \(\left\{ \lambda _k\right\} \). The above inequality also shows the quadratic convergence of the sequence \(\Big \{\frac{\lambda _k}{\underline{\sigma }_k^{\frac{3-\nu }{2}}}\Big \}\).
Finally, to prove the local quadratic convergence of \(\left\{ x^k\right\} \) to \(x^{\star }\), we use the same argument as in the proof of Theorem 3 and Theorem 5, where we omit the details here. \(\square \)
1.9 The proof of Theorem 7: convergence of the quasi-Newton method
The full-step quasi-Newton method for solving (24) can be written as \(x^{k+1} = x^k - B_k\nabla {f}(x^k)\). This is equivalent to \(H_k(x^{k+1} - x^k) + \nabla {f}(x^k) = 0\). Using this relation and \(\nabla {f}(x^{\star }_f) = 0\), we can write
We first consider \(T_k := \Vert \nabla ^2{f}(x^{\star }_f)^{-1}\left[ \nabla {f}(x^k) - \nabla {f}(x^{\star }_f) - \nabla ^2{f}(x^{\star }_f)(x^k - x^{\star }_f) \right] \Vert _{x^{\star }_f}\). Similar to the proof of Theorem 3, we can show that
where \(R_{\nu }\) is defined by (56) and \(d_{\nu }^k := M_f^{\nu -2}\left( \frac{\nu }{2} - 1\right) \Vert x^k-x^{\star }_f\Vert _2^{3-\nu }\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}^{\nu -2}\). Moreover, we note that
Combining this estimate, (76), and (77), we can derive
First, we prove statement (a). Indeed, from the Dennis–Moré condition (41), we have
where \(\lim _{k\rightarrow \infty }\gamma _k = 0\). Substituting this estimate into (78), and noting that \(\Vert x^k - x^{\star }_f\Vert _2 \le \frac{1}{\underline{\sigma }^{\star }}\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\), where \(\underline{\sigma }^{\star } := \lambda _{\min }(\nabla ^2{f}(x^{\star }_f)) > 0\), we can show that
provided that \(\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \le \bar{r}\) and \(R_{\nu }^{\star } := \max \left\{ R_{\nu }(d_{\nu }^k) \mid \Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \le \bar{r}\right\} < +\,\infty \). Here, \(\bar{r} > 0\) is a given value such that \(R_{\nu }^{\star }\) is finite. The estimate (79) shows that if \(\bar{r}\) is sufficiently small, \(\left\{ \Vert x^k - x^{\star }_f\Vert _{x^{\star }_f}\right\} \) superlinearly converges to zero. Finally, the statement (b) is proved similarly by combining statement (a) and [62, Theorem 11]. \(\square \)
Rights and permissions
About this article
Cite this article
Sun, T., Tran-Dinh, Q. Generalized self-concordant functions: a recipe for Newton-type methods. Math. Program. 178, 145–213 (2019). https://doi.org/10.1007/s10107-018-1282-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-018-1282-4
Keywords
- Generalized self-concordance
- Newton-type method
- Proximal Newton method
- Quadratic convergence
- Local and global convergence
- Convex optimization