Local convergence of tensor methods

In this paper, we study local convergence of high-order Tensor Methods for solving convex optimization problems with composite objective. We justify local superlinear convergence under the assumption of uniform convexity of the smooth component, having Lipschitz-continuous high-order derivative. The convergence both in function value and in the norm of minimal subgradient is established. Global complexity bounds for the Composite Tensor Method in convex and uniformly convex cases are also discussed. Lastly, we show how local convergence of the methods can be globalized using the inexact proximal iterations.


Introduction
Motivation.In Nonlinear Optimization, it seems to be a natural idea to increase the performance of numerical methods by employing high-order oracles.However, the main obstacle to this approach consists in a prohibiting complexity of the corresponding Taylor approximations formed by the highorder multidimensional polynomials, which are difficult to store, handle, and minimize.If we go just one step above the commonly used quadratic approximation, we get a multidimensional polynomial of degree three which is never convex.Consequently, its usefulness for optimization methods is questionable.
However, recently in [18] it was shown that the Taylor polynomials of convex functions have a very interesting structure.It appears that their augmentation by a power of Euclidean norm with a reasonably big coefficients gives us a global upper convex model of the objective function, which keeps all advantages of the local high-order approximation.
One of the classical and well-known results in Nonlinear Optimization is related to the local quadratic convergence of Newton's method [13,19].Later on, it was generalized to the case of composite optimization problems [14], where the objective is represented as a sum of two convex components: smooth, and possibly nonsmooth but simple.Local superlinear convergence of the Incremental Newton method for finite-sum minimization problems was established in [24].
The study of high-order numerical methods for solving nonlinear equations is dated back to the work of Chebyshev in 1838, where the scalar methods of order three and four were proposed [2].The methods of arbitrary order for solving nonlinear equations were studied in [6].
A big step in the second-order optimization theory was made since [22], where Cubic regularization of the Newton method with its global complexity estimates was proposed.Additionally, the local superlinear convergence was justified.See also [1] for the local analysis of the Adaptive cubic regularization methods.
Our paper is aimed to study local convergence of high-order methods, generalizing corresponding results from [22] in several ways.We establish local superlinear convergence of Tensor Method [18] of degree p ≥ 2, in the case when the objective is composite, and its smooth part is uniformly convex of arbitrary degree q from the interval 2 ≤ q < p + 1.For strongly convex functions (q = 2), this gives the local convergence of degree p.
Contents.We formulate our problem of interest and define a step of the Regularized Composite Tensor Method in Sect. 2.Then, we declare some of its properties, which are required for our analysis.
In Sect.3, we prove local superlinear convergence of the Tensor Method in function value, and in the norm of minimal subgradient, under the assumption of uniform convexity of the objective.
In Sect.4, we discuss global behavior of the method and justify sublinear and linear global rates of convergence for convex and uniformly convex cases, respectively.
One application of our developments is provided in Sect. 5. We show how local convergence can be applied for computing an inexact step in proximal methods.A global sublinear rate of convergence for the resulting scheme is also given.

Notations and generalities.
In what follows, we denote by E a finitedimensional real vector space, and by E * its dual spaced composed by linear functions on E. For such a function s ∈ E * , we denote by s, x its value at x ∈ E. Using a self-adjoint positive-definite operator B : E → E * (notation B = B * ≻ 0), we can endow these spaces with mutually conjugate Euclidean norms: For a smooth function f : dom f → R with convex and open domain dom f ⊆ E, denote by ∇f (x) its gradient, and by ∇ 2 f (x) its Hessian evaluated at point In what follows, we often work with directional derivatives.For p ≥ 1, denote by the directional derivative of function f at x along directions h i ∈ E, i = 1, . . ., p.If all directions h 1 , . . ., h p are the same, we apply a simpler notation Its norm is defined in the standard way: (for the last equation see, for example, Appendix 1 in [21]).Similarly, we define In particular, for any x ∈ dom f and h 1 , h 2 ∈ E, we have Thus, for the Hessian, our definition corresponds to a spectral norm of the self-adjoint linear operator (maximal module of all eigenvalues computed with respect to B ≻ 0).
Finally, the Taylor approximation of function f (•) at x ∈ dom f is defined as follows: Consequently, for all y ∈ E we have (1.4)

Main inequalities
In this paper, we consider the following composite convex minimization problem min where h : E → R ∪ {+∞} is a simple proper closed convex function and f ∈ C p,p (dom h) for a certain p ≥ 2. In other words, we assume that the pth derivative of function f is Lipschitz continuous: Assuming that L p < +∞, by the standard integration arguments we can bound the residual between function value and its Taylor approximation: Applying the same reasoning to functions ∇f (•), h and ∇ 2 f (•)h, h with direction h ∈ E being fixed, we get the following guarantees: which are valid for all x, y ∈ dom h.
Let us define now one step of the Regularized Composite Tensor Method (RCTM) of degree p ≥ 2: It can be shown that for the auxiliary optimization problem in (2.6) is convex (see Theorem 1 in [18]).This condition is crucial for implementability of our methods and we always assume it to be satisfied.Let us write down the first-order optimality condition for the auxiliary optimization problem in (2.6): for all y ∈ dom h.In other words, for vector we have h ′ (T ) ∈ ∂h(T ).This fact explains our notation Let us present some properties of the point T = T H (x). First of all, we need some bounds for the norm of vector F ′ (T ).Note that (2.11) Consequently, (2.12) Secondly, we use the following lemma.
Lemma 1 Let β > 1 and H = βL p .Then In particular, if β = p, then This means that Then inequality (2.15) can be rewritten as follows: .

It remains to note that
The main goal of this paper consists in analyzing the local behavior of the Regularized Composite Tensor Method (RCTM): as applied to the problem (2.1).In order to prove local superlinear convergence of this scheme, we need one more assumption.

Assumption 1
The objective in problem (2.1) is uniformly convex of degree q ≥ 2. Thus, for all x, y ∈ dom h and for all G x ∈ ∂F (x), G y ∈ ∂F (y), it holds: for certain σ q > 0.
It is well known that this assumption guarantees the uniform convexity of the objective function (see, for example, Lemma 4.2.1 in [19]): where G x is an arbitrary subgradient from ∂F (x).Therefore, This simple inequality gives us the following local convergence rate for RCTM.Theorem 1 For any k ≥ 0 we have Proof Indeed, for any k ≥ 0 we have And this is exactly inequality (3.5).⊓ ⊔

1) has local superlinear rate of convergence for problem (2.1).
Proof Indeed, in this case p q−1 > 1.
⊓ ⊔ For example, if q = 2 (strongly convex function) and p = 2 (Cubic Regularization of the Newton Method), then the rate of convergence is quadratic.If q = 2, and p = 3, then the local rate of convergence is cubic, etc.
Let us study now the local convergence of the method (3.1) in terms of the norm of gradient.For any x ∈ dom h denote Theorem 2 For any k ≥ 0 we have Proof Indeed, in view of inequality (3.2), we have where g k is an arbitrary vector from ∂h(x k ).Therefore, we conclude that It remains to use inequality (2.12).⊓ ⊔ As we can see, the condition for superlinear convergence of the method (3.1) in terms of the norm of the gradient is the same as in Corollary 1: we need to have p q−1 > 1, that is p > q −1.Moreover, the local rate of convergence has the same order as that for the residual of the function value.
According to Theorem 1, the region of superlinear convergence of RCTM in terms of the function value is as follows: Alternatively, by Theorem 2, in terms of the norm of minimal subgradient (3.6), the region of superlinear convergence looks as follows: (3.9) Note that these sets can be very different.Indeed, set Q is a closed and convex neighborhood of the point x * .At the same time, the structure of the set G can be very complex since in general the function η(x) is discontinuous.Let us look at simple example where h(x) = Ind Q (x), the indicator function of a closed convex set Q.
Clearly, in this problem x * = (0, −1), and it can be written in the composite form (2.1) with h(x) = +∞, if x > 1, 0, otherwise.Note that for x ∈ dom h ≡ {x : x ≤ 1}, we have Thus, in any neighbourhood of x * , η(x) vanishes only along the boundary of the feasible set.
⊓ ⊔ So, the question arises how the Tensor Method (3.1) could come to the region G.The answer follows from the inequalities derived in Section 2. Indeed, .
Thus, at some moment the norm F ′ (x k ) * will be small enough to enter G.

Global complexity bounds
Let us briefly discuss the global complexity bounds of the method (3.1), namely the number of iterations required for coming from an arbitrary initial point x 0 ∈ dom h to the region Q.First, note that for every step T = T H (x) of the method with parameter H ≥ pL p , we have Therefore, with x * def = arg min y∈E F (y), which exists by our assumption.Denote by D the maximal radius of the initial level set of the objective, which we assume to be finite: Then, by monotonicity of the method (3.1) and by convexity we conclude In the general convex case, we can prove the global sublinear rate of convergence of the Tensor Method of the order O(1/k p ) [18].For completeness of presentation, let us prove an extension of this result onto the composite case.
Theorem 3 For the method (3.1) with H = pL p we have Proof Indeed, in view of (2.14) and (4.2), we have for every k ≥ 0 ≤ 1, as follows: Then, Lemma 1.1 from [8] provides us with the following guarantee: Therefore, ⊓ ⊔ For a given degree q ≥ 2 of uniform convexity with σ q > 0, and for RCTM of order p ≥ q − 1, let us denote by ω p,q the following condition number : .
Corollary 2 In order to achieve the region Q it is enough to perform 2p • q q (q − 1) q−1 • ω p+1 p p,q iterations of the method.
Proof Plugging (3.8) into (4.3).⊓ ⊔ We can improve this estimate, knowing that the objective is globally uniformly convex (3.2).Then the linear rate of convergence arises at the first state, till the entering in the region Q.
Theorem 4 Let σ q > 0 with q ≤ p + 1.Then for the method (3.1) with H = pL p , we have Therefore, for a given ε > 0 to achieve Proof Indeed, for every k ≥ 0

⊓ ⊔
We see that, for RCTM with p ≥ 2 minimizing the uniformly convex objective of degree q ≤ p + 1, the condition number ω 1/p p,q is the main factor in the global complexity estimates (4.5) and (4.7).Since in general this number may be arbitrarily big, complexity estimate Õ(ω 1/p p,q ) in (4.7) is much better than the estimate O(ω ) in (4.5) because of relation p+1 p−q+1 ≥ 1.These global bounds can be improved, by using the universal [3,10] and the accelerated [17,9,10,7,28] high-order schemes.
High-order tensor methods for minimizing the gradient norm were developed in [4].These methods achieve near-optimal global convergence rates, and can be used for coming into the region G (3.9).Note, that for the composite minimization problems, some modification of these methods is required, which ensures minimization of the subgradient norm.
Finally, let us mention some recent results [20,12], where it was shown that a proper implementation of the third-order schemes by second-order oracle may lead to a significant acceleration of the methods.However, the relation of these techniques to the local convergence needs further investigations.

Application to proximal methods
Let us discuss now a general approach, which uses the local convergence of the methods for justifying the global performance of proximal iterations.
The proximal method [23] is one of the classical methods in theoretical optimization.Every step of the method for solving problem (2.1) is a minimization of the regularized objective: where {a k } k≥1 is a sequence of positive coefficients, related to the iteration counter.
Of course, in general, we can hope only to solve subproblem (5.1) inexactly.The questions of practical implementations and possible generalizations of the proximal method, are still in the area of intensive research (see, for example [11,27,26,25]).
One simple observation on the subproblem (5.1) is that it is 1-strongly convex.Therefore, if we would be able to pick an initial point from the region of superlinear convergence (3.8) or (3.9), we could minimize it very quickly by RCTM of degree p ≥ 2 up to arbitrary accuracy.In this section, we are going to investigate this approach.For the resulting scheme, we will prove the global rate of convergence of the order Õ(1/k p+1 2 ).Denote by Φ k+1 the regularized objective from (5.1): We fix a sequences of accuracies {δ k } k≥1 and relax the assumption on exact minimization in (5.1).Now, at every step we need to find a point x k+1 and Local convergence of tensor methods 13 corresponding subgradient vector g k+1 ∈ ∂Φ k+1 (x k+1 ) with bounded norm: The following global convergence result holds for the general proximal method with inexact minimization criterion (5.2).
Theorem 5 Assume that there exist a minimum x * ∈ dom h of the problem (2.1).Then, for any k ≥ 1, we have where Proof First, let us prove that for all k ≥ 0 and for every x ∈ dom h, we have where This is obviously true for k = 0. Let it hold for some k ≥ 0. Consider the step number k + 1 of the inexact proximal method.By condition (5.2), we have Equivalently, (5.5) Therefore, using the inductive assumption and strong convexity of Φ k+1 (•), we conclude Thus, inequality (5.4) is valid for all k ≥ 0. Now, plugging x ≡ x * into (5.4),we have (5.6) In order to finish the proof, it is enough to show that α k ≤ R k (δ).Indeed, Therefore, ⊓ ⊔ Now, we are ready to use the result on the local superlinear convergence of RCTM in the norm of subgradient (Theorem 2), in order to minimize Φ k+1 (•) at every step of inexact proximal method.
Note that and it is natural to start minimization process from the previous point x k , for which ∂Φ k+1 (x k ) = a k+1 ∂F (x k ).Let us also notice, that the Lipschitz constant of the pth derivative (p ≥ 2) of the smooth part of Φ k+1 is a k+1 L p .Using our previous notation, one step of RCTM can be written as follows: where H = a k+1 pL p .Then, a sufficient condition for z = x k to be in the region of superlinear convergence (3.9) is or, equivalently To be sure that x k is strictly inside the region, we can pick: Note, that this rule requires fixing an initial subgradient F ′ (x 0 ) ∈ ∂F (x 0 ), in order to choose a 1 .
Finally, we apply the following steps: We can estimate the required number of these iterations as follows.
Lemma 2 At every iteration k ≥ 0 of the inexact proximal method, in order to achieve Φ ′ k+1 (z t ) * ≤ δ k+1 , it is enough to perform (5.9) steps of RCTM (5.8), where Proof According to (3.7), one step of RCTM (5.8) provides us with the following guarantee in terms of the subgradients of our objective Φ k+1 (•): where we used in (3.7) the values q = 2, σ q = 1, a k+1 L p for the Lipschitz constant of the pth derivative of the smooth part of Φ k+1 , and H = a k+1 pL p .
Denote β ≡ it holds Φ ′ k+1 (z t ) * ≤ δ k+1 .To finish the proof, let us estimate F ′ (x k ) * from above.We have (5.13) Thus, for every 1 ≤ i ≤ k it holds , and ρ ≡ p−1 p .Therefore, Substitution of this bound into (5.12)gives (5.9).⊓ ⊔ Let us prove now the rate of convergence for the outer iterations.This is a direct consequence of Theorem 5 and the choice (5.7) of the coefficients {a k } k≥1 .

Lemma 3 Let for a given
Then for every 1 ≤ k ≤ K, we have where Proof Using the inequality between the arithmetic and geometric means, we obtain , where the first inequality holds by convexity.At the same time, we have Since the local convergence of RCTM is very fast (5.9), we can choose the inner accuracies {δ i } i≥1 small enough, to have the right hand side of (5.16) being of the order Õ(1/k p+1 2 ).Let us present a precise statement. .
⊓ ⊔ Note that we were able to justify the global performance of the scheme, using only the local convergence results for the inner method.It is interesting to compare our approach with the recent results on the path-following secondorder methods [5].
We can drop the logarithmic components in the complexity bounds by using the hybrid proximal methods (see [16] and [15]), where at each iteration only one step of RCTM is performed.The resulting rate of convergence there is O(1/k p+1 2 ), without any extra logarithmic factors.However, this rate is worse than the rate O(1/k p ) provided by the Theorem 3 for the primal iterations of RCTM (3.1).