Superfast Second-Order Methods for Unconstrained Convex Optimization

In this paper, we present new second-order methods with convergence rate Ok-4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\left( k^{-4}\right) $$\end{document}, where k is the iteration counter. This is faster than the existing lower bound for this type of schemes (Agarwal and Hazan in Proceedings of the 31st conference on learning theory, PMLR, pp. 774–792, 2018; Arjevani and Shiff in Math Program 178(1–2):327–360, 2019), which is Ok-7/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\left( k^{-7/2} \right) $$\end{document}. Our progress can be explained by a finer specification of the problem class. The main idea of this approach consists in implementation of the third-order scheme from Nesterov (Math Program 186:157–183, 2021) using the second-order oracle. At each iteration of our method, we solve a nontrivial auxiliary problem by a linearly convergent scheme based on the relative non-degeneracy condition (Bauschke et al. in Math Oper Res 42:330–348, 2016; Lu et al. in SIOPT 28(1):333–354, 2018). During this process, the Hessian of the objective function is computed once, and the gradient is computed Oln1ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\left( \ln {1 \over \epsilon }\right) $$\end{document} times, where ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document} is the desired accuracy of the solution for our problem.


Introduction
In the last years, the theory of high-order methods in convex optimization was developed seemingly up to its natural limits. After discovering the simple fact that the Communicated by Anil Aswani.

B Yurii Nesterov
Yurii.Nesterov@uclouvain.be 1 Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), Louvain-la-Neuve, Belgium auxiliary problem in tensor methods can be posed as a problem of minimizing a convex multivariate polynomial [15], very soon the performance of these methods was increased up to the maximal limits [6,7,9], given by the theoretical lower complexity bounds [1,2].
It is interesting that the first accelerated tensor methods were analyzed in the unpublished paper [3], where the author did not express any hope for their practical implementations in the future. In [3] and [15], it was shown that the p-th order methods can accelerate up to the level O k −( p+1) , where k is the iterations counter. The main advantage of the theory in [15] is that it corresponds to the methods with convex polynomial subproblems.
However, the fastest tensor methods [6,7,9] are based on the trick discovered in [11] for the second-order methods. It allows to increase the rate of convergence of tensor methods up to the level O k −(3 p+1)/2 , which matches the lower complexity bounds for functions with Lipschitz-continuous pth derivative. Thus, for example, the best possible rate of convergence of the second-order methods on the corresponding problem class is of the order O k −7/2 .
Unfortunately, this advanced technique requires finding at each iteration a root of a univariate nonlinear non-monotone equation defined by inverse Hessians of the objective function. Hence, from the practical point of view, the methods proposed in [15] remain the most attractive.
The developments of this paper are based on one simple observation. In [15], it was shown that the accelerated tensor method of degree three with the rate of convergence O k −4 can be implemented by using at each iteration a simple gradient method based on the relative non-degeneracy condition [4,10]. This auxiliary method has to minimize an augmented Taylor polynomial of degree three, computed at the current test point x ∈ R n : At each iteration of this linearly convergent scheme, we need to compute the gradient of the auxiliary objective function in h. The only non-trivial part of this gradient comes from the gradient of the third derivative. This is the vector It is the only place where we need the third-order information. However, it is well known that In other words, the vector D 3 f (x)[h] 2 can be approximated with any accuracy by the first-order information. This means that we have a chance to implement the third-order method with the convergence rate O k −4 using only the second-order information. So, formally our method will be of the order two. However, it will have the rate of convergence, which is higher than the formal lower bound O k −7/2 for the secondorder schemes. Of course, the reason for this is that it will work with the problem class initially reserved for the third-order methods. However, interestingly enough, our method will demonstrate on this class the same rate of convergence as the thirdorder schemes.
In order to implement our hint into rigorous statements, we need to introduce in the constructions of Section 5 in [15] some modifications related to the inexactness of the available information. This is the subject of the remaining sections of this paper.
Contents. The paper is organized as follows: In Sect. 2, we introduce a convenient definition of the acceptable neighborhood of the exact tensor step. It differs from the previous ones (e.g. [5,8,13]) since for its verification it is necessary to call the oracle of the main objective function. However, we will see that it significantly simplifies the overall complexity analysis. We prove that every point from this neighborhood ensures a good decrease of the objective functions, which is sufficient for implementing the Basic Tensor Method and its accelerated version without spoiling their rates of convergence.
In Sect. 3, we analyze the rate of convergence of the gradient method based on the relative smoothness condition [4,10], under the assumption that the gradient of the objective function is computed with a small absolute error. We need this analysis for replacing the exact value of the third derivative along two vectors by a finite difference of the gradients. We show that the perturbed method converges linearly to a small neighborhood of the exact solution.
In Sect. 4, we put all our results together in order to justify a second-order implementation of the accelerated third-order tensor method. The rate of convergence of the resulting algorithm is of the order O k −4 , where k is the iteration counter. At each iteration, we compute the Hessian once and the gradient is computed O ln 1 times, where is the desired accuracy of the solution of the main problem. Recall that this rate of convergence is impossible for the second-order schemes working with the functions with Lipschitz-continuous third derivative (see [1,2]). However, our problem class is smaller (see Lemma 4.1).
In Sect. 5, we show how to ensure boundedness of the constants, essential for our minimization schemes. Finally, we conclude the paper with Sect. 6, containing a discussion of our results and directions for future research.

Notation and generalities.
In what follows, we denote by E a finite-dimensional real vector space and by E * its dual spaced composed by linear functions on E. For such a function s ∈ E * , we denote by s, x its value at x ∈ E.
If it is not mentioned explicitly, we measure distances in E and E * in a Euclidean norm. For that, using a self-adjoint positive-definite operator B : E → E * (notation B = B * 0), we define In the formulas involving products of linear operators, it will be convenient to treat x ∈ E as a linear operator from R to E, and x * as a linear operator from E * to R. In this case, x x * is a linear operator from E * to E, acting as follows: For a smooth function f : dom f → R with convex and open domain dom f ⊆ E, denote by ∇ f (x) its gradient, and by ∇ 2 f (x) its Hessian evaluated at point In our analysis, we use Bregman divergence of function f (·) defined as follows: We often work with directional derivatives. For p ≥ 1, denote by is a symmetric p-linear form. Its norm is defined as follows: In terms of our previous notation, for any x ∈ dom f and h 1 , For Hessian, this gives the spectral norm of self-adjoint linear operator (the maximal module of all eigenvalues computed with respect to operator B). If all directions h 1 , . . . , h p are the same, we apply notation Then, Taylor approximation of function f (·) at x ∈ dom f can be written as Note that, in general, we have (see, for example, Appendix 1 in [16]) Similarly, since for x, y ∈ dom f being fixed, the form D p f (x)[·, . . . , ·] − D p f (y)[·, . . . , ·] is p-linear and symmetric, we also have In this paper, we consider functions from the problem classes F p , which are convex and p times differentiable on E. Denote by L p its uniform bound for the Lipschitz constant of their pth derivative: If an ambiguity can arise, we use notation L p ( f ). Sometimes it is more convenient to work with uniform bounds on the derivatives: If both values are well defined, we suppose that Let F(·) be a sufficiently smooth vector function, F : dom F → E 2 . Then, by the well-known Taylor formula, we have Hence, we can bound the following residual: By the same reason, for functions ∇ f (·) and ∇ 2 f (·), we get which are valid for all x, y ∈ dom f . Finally, for simplifying long expressions, we often use the trivial inequality which is valid for all a, b ≥ 0 and p ≥ 1.

Tensor Methods with Inexact Iteration
Consider the following unconstrained optimization problem: where f (·) is a convex function with Lipschitz-continuous pth derivative: In this section, we work only with Euclidean norms.
We are going to solve problem (12) by tensor methods. Their performance crucially depends on ability to achieve a significant improvement in the objective function at the current test point.

Definition 2.1
We say that point T ∈ E ensures p th − or der impr ovement of some point x ∈ E with factor c > 0 if it satisfies the following inequality: This terminology has the following justification. Consider the augmented Taylor polynomial of degree p ≥ 1: By (8), for H ≥ L p , this function gives us an upper estimate for the objective. Moreover, for H ≥ pL p this function is convex (see Theorem 1 in [15]). We are going to generate new test point T as a close approximation to the minimum of functionΩ x, p,H (·). Namely, we are interested in points from the following nested neighborhoods: where γ ∈ [0, 1) is an accuracy parameter. The smallest set N 0 p,H (x) contains only the exact minimizers of the augmented Taylor polynomial. Note thatΩ x, p, for any γ ∈ [0, 1). These neighborhoods are important by the following reason.
Theorem 2.1 Let x ∈ E and parameters γ , H satisfy the following condition: Then, any point T ∈ N γ p,H (x) ensures a pth-order improvement of x with factor Consequently, we have Therefore, In other words, Note that So by convexity of κ(·) and r ≥ r * , we have κ(r ) ≥ κ(r * ). Therefore, Inequality (18) is valid since our function is convex: We have proved that the pth-order improvement at point x ∈ E can be ensured by inexact minimizers of the augmented Taylor polynomials of degree p ≥ 1. Let us present the efficiency estimates for corresponding methods. From now on, let us assume that the constant L p is known. For the sake of notation, we fix the following values of the parameters: Then, we can use a shorter notation for the following objects: As a consequence of all these specifications, we have the following result.

Corollary 2.1 For any x ∈ E, all points from the neighborhood N p (x) ensure the pth-order improvement of x with factor c p .
Let us start from the simplest Inexact Basic Tensor Method: Theorem 2.2 Let the sequence {x k } k≥0 be generated by method (21). Then, for any k ≥ 1 we have Consequently, Hence, in view of Lemma 11 in [13], we have This is exactly the estimate (22).
Let us present a convergence analysis for Inexact Accelerated Tensor Method. We need to choose the degree of the method and define the prox-function This is a uniformly convex function of degree p + 1: for all x, y ∈ E we have [14]). Define the sequence Therefore, the elements of sequence {A k } k≥0 satisfy the following inequality: Inexact pth-Order Accelerated Tensor Method (ATMI p ) Initialization. Choose x 0 ∈ E. Define coefficients A k by (24) and First of all, note that by induction it is easy to see that In particular, for ψ * Let us prove by induction the following relation: For k = 0, we have ψ * 0 = 0 and A 0 = 0. Hence, (29) is valid. Assume it is valid for some k ≥ 0. Then, Note that Finally, since x k+1 ∈ N p (y k ), by Corollary 2.1 we get Putting all these inequalities together, we obtain Thus, we have proved the following theorem.
Proof Indeed, in view of relations (27) and (29), we have

Relative Non-degeneracy and Approximate Gradients
In this section, we measure distances in E by general norms. Consider the following composite minimization problem: where the convex function ϕ(·) is differentiable, and ψ(·) is a simple closed convex function. The most important example of function ψ(·) is an indicator function for a closed convex set. Denote by x * one of the optimal solutions of problem (31), and let F * = F(x * ).
Let ϕ(·) be non-degenerate with respect to some scaling function d(·): ≤ 1 the condition number of function ϕ(·) with respect to the scaling function d(·). Sometimes it is more convenient to work with the second-order variant of the condition (32): We are going to solve problem (31) using an approximate gradient of the smooth part of the objective function. Namely, at each point x ∈ E we use a vector g ϕ (x) such that where δ ≥ 0 is an accuracy parameter. Our first goal is to describe the influence of parameter δ onto the quality of the computed approximate solutions to problem (31). For this, we need to assume that function d(·) is uniformly convex of degree p + 1 with p ≥ 1: Consider the following Bregman Distance Gradient Method (BDGM), working with inexact information.
Choose x 0 ∈ E. For k ≥ 0 iterate: Lemma 3.1 Let the approximate gradient g ϕ (x k ) satisfy the condition (34). Then, for any x ∈ E and k ≥ 0 we have Proof The first-order optimality condition defining x k+1 is as follows: for all x ∈ dom ψ. Therefore, denoting r k (x) = β d (x k , x), we have Hence, Since x = − x for all x in E, the minimum in x k of the expression in brackets is attained at some x k = (1 − α)x k+1 + αx with α ∈ (0, 1). On the other hand, the minimum of the function Thus, Applying inequality (37) with x = x * recursively to all k = 0, . . . , T − 1, we get the following relation: where γ = 1 4 γ d (ϕ), and S T = F(x k ), we get the following bound: Note that lim In our main application, presented in Sect. 4, we need to generate points with small norm of the gradient. In order to achieve this goal with method (36), we need one more assumption on the scaling function d(·).
From now on, we consider the unconstrained minimization problems. This means that in (31) we have ψ(x) = 0 for all x ∈ E.

Definition 3.1 We call the scaling function d(·) norm
for all x ∈ S and y ∈ E.
Note that the statement of Lemma 3.2 can be extended onto all convex polynomial scaling functions.
Norm-dominated scaling functions are important in view of the following.

Lemma 3.3 Let scaling function d(·) be norm-dominated on the level set
by some function θ(·). Then, for any x ∈ L ϕ (x) we have: Proof Indeed, for any x ∈ L ϕ (x) and y ∈ E we have Therefore, Thus, for norm-dominated scaling functions, the rate of convergence in function value can be transformed into the rate of decrease of the norm of the gradient of function ϕ(·). This feature is very important for practical implementations of Inexact Tensor Methods presented in Sect. 2. In the next section, we discuss in details how it works for inexact third-order methods.

Second-Order Implementations of the Third-Order Methods
In this section, we are going to solve the unconstrained minimization problem where the objective function is convex and smooth, using the second-order implementations of the third-order methods. For the pure second-order methods, the standard assumption on the objective function in (45) is Lipschitz continuity of the second derivative (see, for example, [12,17]). We are going to replace it by a stronger assumption, using the following fact.
Proof Let x ∈ dom f . Then, for any direction h ∈ E and τ > 0 small enough, we Minimizing this inequality in τ > 0 and taking the supremum of the result in h ∈ E, we get (46). Thus, from now on, we assume that Assumption M 2 ( f ) < +∞ is not so necessary. We will discuss different variants of its replacements in Sect. 5. In our situation, we can apply to (45) the third-order tensor method ATMI 3 (see 26). At each iteration of this method, we need to minimize the augmented third-order Taylor polynomialΩ x,3,H (·). As it was shown in [15], this can be done by an auxiliary scheme based on the relative smoothness condition. This approach is based on the following matrix inequality (see Lemma 3 in [15]): which is valid for all x ∈ dom f , h ∈ E and ξ > 0. As compared with [15], our situation is more complicated. Firstly, we are not going to use the exact minimum of functionΩ x,3,H (·). And secondly, we are going to minimize this function using its approximate gradients.
Let us start from discussion of the second issue. Let us fix a parameter τ > 0 and for all x, y ∈ E, consider the following vector functions: the finite-difference approximations of third derivative along direction [x − y] 2 .

Lemma 4.2
For any x, y ∈ E, we have Proof Denote h = τ (x − y). Then, by Taylor formula we have Applying a uniform upper bound for the fourth derivative to the right-hand side of this representation, we get inequality (49). Further, Adding these two representations, we get and we obtain inequality (50). If the fourth derivative derivative is Lipschitz continuous, then and this is inequality (51).
In this paper, we usually employ the approximation g τ y (·). Note that where h = x − y. Thus, we can easily compute approximate gradients of function Ω y,3,H (·) using the first-order information on function f (·). Let us show that this can help us to minimize the augmented Taylor polynomial of degree three by the machinery presented in Sect. 3. At each iteration k of ATMI 3 , we need to find point x k+1 ∈ N 3 (y k ). For the sake of notation, let us assume that y k = 0. We need to find a point x + ∈ N 3 (0) by minimizing the function Thus, our auxiliary problem is as follows: Therefore, Now it is clear that in our case a good scaling function is as follows: Indeed, applying the relations (56) with ξ = √ 2, we get Thus, we can take and obtain for function ϕ k (·) the condition number bounded by a constant: The second condition for applicability of method (36) is the uniform convexity of the Bregman distance. In our case, this is true since Thus, in terms of inequality (35), we have σ 4 (ρ k ) = 1 4 L 3 . This property is important for bounding the size of the set

Lemma 4.3
For any x ∈ L k , we have Proof Indeed, Consequently, we have the following bound: Further, for x ∈ L k , we have Thus, x ≤ 16 The third condition is the possibility of approximating the gradient of function ϕ k (·). In our case, in view of Lemma 4.2, we can take In this case, Thus, in order to ensure condition (34) and keep τ separated from zero (this is necessary for stability of the process), we need to guarantee the boundedness of the minimizing sequence for function ϕ k (·). However, since we know an explicit upper bound (60) on the size of the optimal point, it is possible to ensure this by introducing an additional constraint on the size of variables. Let us replace the problem (53) by the following one: In view of Lemma 4.3, the optimal solutions of problems (53) and (64) coincide. Consider a variant of method (36) with ψ ≡ 0 and accuracy δ > 0.
Note that the auxiliary problem in this method has now an additional ball constraint (64). However, this does not increase significantly its complexity since the Euclidean norm is already present in the objective function.
Let us mention the main properties of this minimization process. First of all, since all points x i belong to S k , for all i ≥ 0 we have This means, in particular, that the sopping criterion at Step 2 of method (65) is correct: if it is satisfied, then which implies x i ∈ N 3 (0). Moreover, we can apply Lemma 3.1 to the following objects: Therefore, in our case, inequality (37) with p = 3 can be rewritten as In view of (57), where L 1 is any upper estimate for the value ∇ 2 f (0) . From this bound, we have a natural limit for the number of iterations of method (65), sufficient for obtaining the following inequality: wherex T = arg min Indeed, for this it is enough to have Hence, we have the following bound: (71) However, the upper-level method ATMI 3 needs a point with small gradient: In order to derive this bound from inequality (70) with an appropriate value ofδ + , we use the fact that our scaling function ρ k (·) is norm-dominated. Indeed, in view of Lemma 3.2 and representation (57), this function is norm-dominated on any Euclidean ball B r by the following function: Hence, in view of Lemma 4.3, our scaling function ρ k (·) is norm-dominated on the set L k by θr k (·) withr Thus, in order to apply Lemma 3.3, we need to estimate from above the inverse to its conjugate function.

Lemma 4.4
For any r > 0, we have Proof Consider the primal function θ(τ ) = aτ 2 2 + bτ 4 4 with a, b ≥ 0. Then, its conjugate function is defined as follows: We need to find λ ≥ 0 from the equation ξ = θ * (λ). Note that the optimal solution τ = τ (λ) in the above maximization problem can be found from the equation Therefore, Thus, we can write down τ (λ) as a function of ξ : Hence, . It remains to use the actual values a = L 1 + 5L 3 r 2 and b = 2L 3 .
Now we can write down the condition for our parameter δ, which ensures the desired inequality (72). Indeed, in view of inequalities (70) and (44), after T k (δ) inner steps (see 71) we can guarantee that where L . In order to stop method (65) at this moment, we need to guarantee that the norm of the approximate gradient is small enough. Hence, our condition for parameter δ can be derived from the following reasoning. Since where g > 0 is a lower bound for the norm of the gradients of the objective function during the whole minimization process. Recall that Hence, this inequality can be rewritten in the following form: 2(1 + (24) 3/4 )δ + 6Lδ 2/3 2L 1 L 1/3 3 Using the upper integer bounds on the coefficients, it can be strengthened: where we take L 1 = ∇ 2 f (0) since this corresponds to the actual role of this constant in the complexity analysis of method (65). This means that, in accordance to (78), we need to choose Note that all coefficients in the condition (78) are known (provided that we have a good estimate for the Lipschitz constant L 3 ). Thus, we have where G and H are the uniform upper bounds for the norms of the gradients and Hessians computed at the points generated by the main process. Validity of the assumption on finiteness of these bounds is discussed in Sect. 5. Let us write down our inexact algorithmic schemes (21) and (26), employing the inner procedure (65). These methods have only one parameter δ > 0, which must be chosen in accordance to (78). They need also the constant L 3 .
We start from the variant of Inexact Basic Tensor Method (21).
Inexact 3rd-Order Tensor Method Initialization. Given δ > 0, choose x 0 ∈ E. Iteration k ≥ 0. Compute x k+1 ∈ N 3 (x k ) by method (65) with the following settings: At each iteration of this method, we have O ln G+H g iterations of the inner scheme. Each of them needs three calls of oracle of the main objective function (twice for computing the approximate gradient of function ϕ k (·) and once for verifying the stopping criterion). In view of Theorem 2.2, the rate of convergence of the main process

Bounds for the Derivatives
The complexity analysis in Sect. 4 is valid only if we can guarantee the finiteness of the constants G and H . The simplest way of doing this consists in considering the following class of functions: This is a nontrivial class, but it is quite restrictive. In this section, we show that it is possible to derive the finiteness of G and H from our main assumption (47) and the properties of the minimization schemes.
Indeed, we can easily bound derivatives at test points from a bounded set. Let us present a trivial result, which follows from Taylor formula (7).

Lemma 5.1 For any x
We can use the right-hand sides of inequalities (87) as our constants G and H provided that the distance between x 0 and the test points does not exceed some D < +∞. Note that we do not use D, G, and H in our methods. They appear only in the bounds for the number of inner steps and stay inside the logarithm. The important criterion (78), defining an appropriate value of the parameter δ > 0, is based on the available information about the first and second derivatives at the current test point.
Thus, we need to prove that the sequences of test points in our methods are bounded. Let us start from Inexact Basic Tensor Method (80). For this method, the situation is very simple. We have already assumed that the size of the level set R(x 0 ) is finite. Since the method (80) is monotone, for any x k generated by this scheme, we have Thus, we can take in (87) D = 2R(x 0 ).
Let us look now at Inexact Accelerated Tensor Method. Actually, for proving the boundedness of sequences of the test points {y k } k≥0 , it is better to consider its monotone variant. The additional Step 4 of this method ensures monotonicity of the sequence { f (x k )} k≥0 . this way, we attach the 1st-order methods to functions with Lipschitz-continuous gradients. The 2nd-order methods correspond to the functions with Lipschitz-continuous Hessian, etc.
This picture allows us to speak about the optimal methods. For example, we say that the Fast Gradient Methods (FGM) with the convergence rate O k −2 are the optimal 1st-order methods. However, the only reason why FGM could be called optimal is that they implement the lower bound for a certain problem class, which is considered to be the natural field of application for the 1st-order methods only. Now it is clear the above over-simplified picture of the world must be replaced by something more elaborated. We have seen that there exist problem classes for which the 2nd-and the 3rd-order methods demonstrate the same rate of convergence. So, the correct classification of problem classes and optimization methods must be at least two-parametric. This is, of course, an interesting topic for the further research.
Another interesting question is related to the 1st-order schemes. Indeed, if we managed to accelerate the 2nd-order methods above their "natural" complexity limits, may be there exists a similar possibility for the 1st-order schemes? In our opinion, the answer is negative. Indeed, the lower complexity bounds for the 1st-order methods are supported by a worst-possible quadratic function. Quadratic functions already have zero high-order derivatives. Therefore, any assumptions on the high-order derivatives cannot eliminate this bad function from the problem class. For the 2nd-order methods, the worst-case function has discontinuous third derivative (see, for example, Section 4.3.1 in [14]). Therefore, assumptions on the fourth derivative can help.