Implementable tensor methods in unconstrained convex optimization

In this paper we develop new tensor methods for unconstrained convex optimization, which solve at each iteration an auxiliary problem of minimizing convex multivariate polynomial. We analyze the simplest scheme, based on minimization of a regularized local model of the objective function, and its accelerated version obtained in the framework of estimating sequences. Their rates of convergence are compared with the worst-case lower complexity bounds for corresponding problem classes. Finally, for the third-order methods, we suggest an efficient technique for solving the auxiliary problem, which is based on the recently developed relative smoothness condition (Bauschke et al. in Math Oper Res 42:330–348, 2017; Lu et al. in SIOPT 28(1):333–354, 2018). With this elaboration, the third-order methods become implementable and very fast. The rate of convergence in terms of the function value for the accelerated third-order scheme reaches the level \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\left( {1 \over k^4}\right) $$\end{document}O1k4, where k is the number of iterations. This is very close to the lower bound of the order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\left( {1 \over k^5}\right) $$\end{document}O1k5, which is also justified in this paper. At the same time, in many important cases the computational cost of one iteration of this method remains on the level typical for the second-order methods.

For practical implementations, the most important results are included in Sect. 5, where we discuss an efficient scheme for minimizing the regularized Taylor approximation of degree three. This auxiliary convex problem can be treated in the framework of relative smoothness condition. The first element of this approach was introduced in [4], for generalizing the Lipschitz condition for the norm of the gradient. In [21] it was shown that the same extension can be applied to the condition of strong convexity. This second step is important since it leads to linearly convergent methods for functions with nonstandard growth properties. The auxiliary problem with the third-order tensor is a good application of this technique. We show that the corresponding method converges linearly, with the rate depending on an absolute constant. In the end of the section, we argue that the complexity of one iteration of the resulting third-order scheme is often of the same order as that of the second-order methods.
In the last Sect. 6 we discuss the presented results and mention the open problems.

Notations and generalities
In what follows, we denote by E a finite-dimensional real vector space, and by E * its dual spaced composed by linear functions on E. For such a function s ∈ E * , we denote by s, x its value at x ∈ E. Using a self-adjoint positive-definite operator B : E → E * (notation B = B * 0), we can endow these spaces with conjugate Euclidean norms: Sometimes, in the formulas involving products of linear operators, it will be convenient to treat x ∈ E as a linear operator from R to E, and x * as a linear operator from E * to R. In this case, x x * is a linear operator from E * to E, acting as follows: For a smooth function f : dom f → R with convex and open domain dom f ⊆ E, denote by ∇ f (x) its gradient, and by ∇ 2 f (x) its Hessian evaluated at point x ∈ dom f ⊆ E. Note that In what follows, we often work with directional derivatives. For p ≥ 1, denote by For example, for any x ∈ dom f and h 1 , h 2 ∈ E, we have Thus, for the Hessian, our definition corresponds to the spectral norm of self-adjoint linear operator (maximal module of all eigenvalues computed with respect to operator B).
If all directions h 1 , . . . , h p are the same, we apply notation Then, Taylor approximation of function f (·) at x ∈ dom f can be written as follows: Note that in general, we have (see, for example, Appendix 1 in [30]) . . , ·] is p-linear and symmetric, we also have In this paper, we consider functions from the problem classes F p , which are convex and p times differentiable on E. Denote by L p its uniform bound for the Lipschitz constant of their pth derivative: Sometimes, if an ambiguity can arise, we us notation L p ( f ).
Assuming that f ∈ F p and L p < +∞, by the standard integration arguments we can bound the residual between function value and its Taylor approximation: If p ≥ 2, then applying the same reasoning for functions ∇ f (·), h and ∇ 2 f (·)h, h with direction h ∈ E being fixed, we get the following guarantees: which are valid for all x, y ∈ dom f .

Convex tensor approximations
In our methods, we use the following power prox function Note that All results of this paper are based on the following observation. From now on, for the sake of simplicity, we assume that dom f = E.
Theorem 1 Let f ∈ F p with p ≥ 2 and L p < +∞. Then for any x, y ∈ E we have Moreover, for any M ≥ L p and any x ∈ E, function 1 is convex and Proof Let us fix arbitrary x and y from dom f . Then for any direction h ∈ E we have and this is (2.3). Further, Thus, function x, p,M (·) is convex. Finally, The statements of Theorem 1 explain our interest to the following point: with M ≥ L p . We are going to use such points for solving the problem starting from some point x 0 ∈ E, where f ∈ F p and L p < +∞. We assume that there exists at least one solution x * of this problem and that the level sets of f are bounded: In this case, in view of (2.5), the level sets of function x, p,M (·) are also bounded. Therefore, point T = T p,M (x) is well defined. It satisfies the following first-order optimality condition: Multiplying this equality by T − x, we get

Lemma 1
For any x ∈ E and M ≥ L p we have (2.13) First two inequalities of this lemma are already known (see, for example, [8]). We provide them with a simple proof for the reader convenience.
and this is inequality (2.11). Further, (2.14) Therefore, we get the following bound: which leads to (2.12). Finally, Squaring both sides of this bound, we get: and this is (2.13).

Corollary 1
For any x ∈ E and M ≥ L p , we have Proof Indeed, in view of inequality (2.13), we have the following bound: It remains to substitute in this bound the values of our parameters a, b, and α.
Let us estimate now the rate of convergence of the following process: where M ≥ L p . Thus, in view of Theorem 1, point x t+1 is a solution to the auxiliary convex problem (2.6).
Theorem 2 Let sequence {x t } t≥0 be generated by method (2.16) as applied to problem (2.7). Then for all (2.17) Proof In view of inequality (2.5), method (2.16) is monotone. Hence, for all t ≥ 0 we have Let us prove the first inequality in (2.17). First of all, let us estimate the difference and this is (2.17) for t = 1. Further, for any t ≥ 1, we have f (x t+1 ) (2.11) ≤ min The minimum of the above objective in α ≥ 0 is achieved for Thus, we conclude that Denoting δ t = f (x t ) − f * , we get the following estimate: Thus, for μ t = C p δ t , the recursive inequality is as follows: Then, This means that 1 Therefore, and this is (2.17).

Accelerated tensor methods
In order to accelerate method (2.16), we apply a variant of the estimating sequences technique, which becomes a standard tool for accelerating the usual Gradient and Second-Order Methods (see, for example, [25][26][27]). In our situation, this idea can be applied to tensor methods in the following way. For solving the problem (2.7), we choose a constant M ≥ L p and recursively update the following sequences.
• Sequence of estimating functions where k (x) are linear functions in x ∈ E, and C is a positive parameter.
• Sequence of scaling parameters {A k } ∞ k=1 : For these objects, we are going to maintain the following relations: Let us ensure that relations (3.2) hold for k = 1. We choose On the other hand, in view of definition (3.1), we get and R 2 1 follows. Assume now that relations (3.2) hold for some k ≥ 1. Denote Let us choose some a k > 0 and M ≥ L p . Define In view of R 2 k and convexity of f , for any x ∈ E we have and this is R 2 k+1 . Let us show now that, for the appropriate choices of a k , C and M, relation R 1 k+1 is also valid.
Indeed, in view of R 1 k and Lemma 4 in [27], for any x ∈ E, we have Further, if we choose M ≥ L p , then by inequality (2.15) we have Hence, our choice of parameters must ensure the following inequality: for all x ∈ E. Minimizing this expression in x ∈ E, we come to the following condition: Substituting in this inequality expressions for corresponding constants, we obtain After cancellations, we get For the sake of notation, let us choose Then, inequality (3.6) becomes simpler: Let us choose some α > 0 and define (3.9) Note that for any k ≥ 0, using trivial inequality then α , and inequality (3.8) is satisfied for all k ≥ 0. Now we are ready to write down the accelerated tensor method. Define Iteration k, (k ≥ 1): (3.12) The above discussion proves the following theorem.
Theorem 3 Let the sequence {x k } ∞ k=1 be generated by method (3.12) as applied to the problem (2.7). Then for any k ≥ 1 we have: Proof Indeed, we have shown that Thus, (3.13) follows from (3.11).
Note that the point v k can be found in (3.12) by a closed-form expression. Consider Since function k (·) is linear, this vector is the same for all x ∈ E. Therefore, the point v k is a solution of the following minimization The first-order optimality condition for this problem is as follows: Thus, we get the following closed-form expression for its solution:

Lower complexity bounds for tensor methods
For constructing functions, which are difficult for all tensor methods, it is convenient to assume that E = E * = R n , and B = I n , the identity n × n-matrix. Thus, in this section we work with the standard Euclidean norm For an integer parameter p ≥ 1, define the following function: Clearly, η p+1 ∈ F p . On the other hand, for any x and h ∈ R n , we have Therefore, for all x, y, and h from R n , by Cauchy-Schwartz inequality we have For integer parameter k, 2 ≤ k ≤ n, let us define the following k ×k upper triangular matrix with two nonzero diagonals: Now we can introduce n × n upper triangular matrix A k with the following structure: Note that Our parametric family of difficult functions is defined in the following way: where e 1 = (1, 0, . . . , 0) T . Let us compute its optimal solution from the first-order optimality condition , withē k ∈ R k being the vector of all ones, and 0 n−k beng the origin in R n−k . Thus, in coordinate form, vector y * k can be found from the following equations: In other words, y * k =ê k and vector x * k = A −1 kê k has the following coordinates: where (τ ) + = max{0, τ }. Consequently, 3 3 . (4.5) Let us describe now the abilities of tensor methods of degree p ≥ 2 in generating new test points. We assume that the response of oracle at pointx ∈ R n consists in the following collection of multi-linear forms: Therefore, we assume that the method is able to compute stationary points of the following polynomial functions with coefficients a ∈ R p , γ > 0 and m > 1. Denote by x, f (a, γ, m) the set of all stationary points of this function. Then we can define the linear subspace Our assumption about the rules of tensor methods is as follows.

Assumption 1
The method generates a sequence of test points {x k } k≥0 satisfying recursive condition Note that for the absolute majority of the first-order, second-order, and tensor methods this assumption is satisfied.

Lemma 2 Any tensor method satisfying Assumption 1 and minimizing function f t (·)
starting from the point x 0 = 0, generates test points {x k } k≥0 satisfying condition Proof Let us prove first that inclusion x ∈ R n k with k ≥ 1 implies S f t (x) ⊆ R n k+1 . Indeed, since matrix A t is upper triangular, this inclusion ensures that y def = A t x ∈ R n k . Therefore, all derivatives of function f t (·) along direction h ∈ R n have the following form: (1) , with certain coefficients d i,k , i = 1, . . . , n, k = 1, . . . , p. This means that the gradients of these derivatives in h are as follows Hence, since the regularization term in definition (4.6) is formed by the standard Euclidean norm, all stationary points of this function belong to R n k+1 . Now we can prove statement of the lemma by induction. For k = 0, we have x 0 = 0, and therefore for all h ∈ R n . Consequently, stationary points of all possible functions φ a,γ ,m (·) belong to R n 1 implying S f t (x 0 ) = R n 1 . Thus, x 1 belongs to R n 1 by Assumption 1. Assume now that all x i ∈ R n k , i = 1, . . . , k, for some k ≥ 1. Then, as we have already seen, S f t (x i ) ⊆ R n k+1 . Hence, the inclusion (4.8) follows from Assumption 1.
Now we can prove the main statement of this section.
where {x k } k≥0 is the sequence of test points, generated by method M and x * is the solution of the problem (2.7). Then for all t ≥ 1 such that 2t + 1 ≤ n we have At the same time, for all x, y, h ∈ R n we have Therefore, L p ( f 2t+1 ) ≤ 2 p p!, and we have

Third-order methods: implementation details
Tensor optimization methods, presented in Sects. 2 and 3, are based on the solution of the auxiliary optimization problem (2.6). In the existing literature on the tensor methods [6,7,20,32], it was solved by the standard local technique of Nonconvex Optimization. However, now we know that by Theorem 1, this problem is convex. Hence, it is solvable by the standard and very efficient methods of Convex Optimization.
Since we need to solve this problem at each iteration of the methods, its complexity significantly affects the total computational time. Since the objective function in the problem (2.6) is a convex multivariate polynomial, we there could exist some special efficient algorithms for finding its solution. Unfortunately, at this moment the authors failed to find such methods in the literature. Therefore, we present in this section a special approach for solving the problem (2.6) with the third degree Taylor approximation, which is based on the recently developed optimization framework of relatively smooth functions (see [4,21]).
Let us fix an arbitrary x ∈ E. Consider the following multivariate polynomial of degree three: 2 Since in this section we work only with third-order approximations, we drop the corresponding index. a symmetric bilinear vector function, and D 3 f (x)[h] is a linear function of h ∈ E, whose values are self-adjoint linear operators from E to E * (as Hessians).
Denote 3 . Then we can define its gradient and Hessian as follows: In this section, our main class of functions is F 3 , composed by convex functions, which are three times continuously differentiable, and for which the constant L 3 is finite. As we have shown in Theorem 1, our assumptions imply interesting relations between derivatives. Let us derive a consequence of the matrix inequality (2.3).

Lemma 3
For any h ∈ E and τ > 0, we have Consequently, for any h, u ∈ E, we get Proof Let us fix arbitrary directions u, h ∈ E. Then, in view of relation (2.3) we have Thus, replacing h by τ h with τ > 0 and dividing the resulting inequality by τ , we get And this is equivalent to the left-hand side of matrix inequality (5.2). Its right-hand side can be obtained by replacing h by −h, which gives Minimizing the right-hand side of this inequality in τ , we get (5.3).
Let us look now at our auxiliary minimization problem: where d 4 (h) = 1 4 h 4 . In view of Theorem 1, function x,M (·) is convex for . Then, we have proved that On the other hand, Thus, we have proved the following lemma.
Then function x,M (·) satisfies the strong relative smoothness condition with respect to function ρ x (·), where κ(τ ) = τ +1 τ −1 . As it is shown in [21], condition (5.7) allows to solve problem (5.4) very efficiently by a kind of primal gradient method. In accordance to this approach, we need to define the Bregman distance of function ρ x (·): and iterate the process In our case, this method has the following form: In accordance to Theorem 3.1 in [21], the rate of convergence of this method is as follows: where h * is the unique optimal solution to problem (5.4).
As we can see, the algorithm (5.8) is very fast. Its linear rate of convergence depends only on absolute constant τ > 1, which can be chosen reasonably close to one for allowing faster convergence of the main tensor methods (2.16) and (3.12). Let us discuss two potentially expensive operations in the implementation of method (5.8).

Computation of the gradient ∇ x,M (h). Note that
In this formula, only the computation of the third derivative may be dangerous. However, this difficulty can be resolved using the technique of automatic differentiation (see, for example, [19]). Indeed, assume we have a sequence of operations for computing the function value f (x) with computational complexity T . Let us fix a direction h ∈ E. Then by forward differentiation, we can generate automatically a sequence of operations for computing the value with computational complexity O(T ). Now, by backward differentiation in x, we can compute the gradient of this function: with computational complexity O(T ). Thus, the oracle complexity of method (5.8) is proportional to the complexity of computing the function value f (x). Another example of simple computation of the third derivative is provided by a separable objective function. Assume that E = R n and where a i ∈ R n and univariate functions f i (·) are three times continuously differentiable, i = 1, . . . , N . Then vector D 3 f [h] 2 has the following representation: Thus, for solving the problem (5.4), we need to compute in advance all values where A 0 and γ > 0. Note that at all these iterations only the vector c and coefficients γ are changing, and matrix A = ∇ 2 f (x) remains the same. Therefore, before the algorithm (5.8) starts working, it is reasonable to transform this matrix in a tri-diagonal form: where U ∈ R n×n is an orthogonal matrix: UU T = I , and T ∈ R n×n is tridiagonal.
Denoting nowc = U T c, we have: Thus, the solution of the primal problem can be retrieved from a solution to the univariate dual problem. The complexity of computing its function value and derivative is linear in n. Moreover, since its objective function is strongly convex and infinitely times differentiable, all reasonable one-dimensional methods have global linear rate of convergence and the quadratic convergence in the end. Let us estimate the total computational complexity of the method (5.8), assuming that the computational time of the value of the objective function is T f . Assume also that its gradient, the product of its Hessian by a vector, and the value of its third derivative on two identical vectors can be computed using the fast backward differentiation (then the complexity of all these operations is O(T f )). Then, the most expensive operations in this method are as follows.
• Computation of the Hessian ∇ 2 f (x) and its tri-diagonal factorization: O(nT f +n 3 ) operations. Thus, we come to the following estimate: O nT f + n 3 + T f + n 2 + n ln 1 δ ln 1 .
This is the same order of complexity as that of one iteration in Trust Region Methods [15] and usual Cubic Regularization [27,31]. However, we can expect that the thirdorder methods converge much faster. For the readers, which are not interested in all these computational details, we just mention that the Galahad Optimization Library [16] has special subroutines for solving the auxiliary problems in the form (5.10).

Discussion
In this paper, we did an important step towards practical implementation of tensor methods in unconstrained convex optimization. We have shown that the auxiliary optimization problems in these scheme can be reduced to minimization of a convex multivariate polynomial. In the important case of third-order tensor, we have proved that this problem can be efficiently solved by a special optimization scheme derived from the relative smoothness condition.
Our results highlight several interesting questions. One of the direct consequences of our approach is a systematic way of generating convex multivariate polynomials. Is it possible to minimize them by some tools of Algebraic Geometry (see [23] for the related technique like sums of squares, etc.), or we need to treat them using an appropriate technique from Convex Optimization? The results of Sect. 5 demonstrate a probably unbeatable superiority of optimization technique for the third-order polynomials. But what happens with polynomials of higher degree?
One of the difficult unsolved problems in our approach is related to dynamic adjustment of the Lipschitz constant for the highest derivative. This dynamic estimate should not be much bigger than the actual Lipschitz constant. On the other hand, it must ensure convexity of the auxiliary problem solved at each iteration of the tensor methods. This question is clearly crucial for the practical efficiency of the high-order schemes.
Simple comparison of the complexity bounds in Sects. 3 and 4 shows that we failed to develop an optimal tensor scheme. The missing factor in the complexity estimates is of the order of O 1 (3 p+1) . For p = 3, this factor is of the order of O 1 1 20 . This means that from the viewpoint of practical efficiency, the cost of one iteration of the hypothetical optimal scheme must be of the same order as that of the accelerated tensor method (3.12). Any additional logarithmic factors in the complexity bound of this "optimal" method will definitely kill its tiny superiority in the convergence rate.
In the last years, we have seen an increasing interest to universal methods, which can adjust to the best possible Hölder condition instead of the Lipschitz one during the running optimization process (see [14,17,29]). Of course, it is very interesting to extend this philosophy onto the tensor minimization schemes. Another important extension could be the treatments of the constraints, either in functional form, or using the framework of composite minimization [28]. The main difficulty here is related to the complexity of the auxiliary optimization problems.
One of the main restrictions for practical implementation of our results is the necessity to know the Lipschitz constant of the corresponding derivative. If our estimate is too small, then the auxiliary problem (2.6) may loose convexity. Consequently, we will loose the fast convergence in the auxiliary process (5.8). However, this observation gives us a clue how to tune this constant: if we see that this process is too slow, this means that our estimate is too small. But of course it is very interesting to find a recipe with better theoretical justification.