Minimizing Uniformly Convex Functions by Cubic Regularization of Newton Method

In this paper, we study the iteration complexity of cubic regularization of Newton method for solving composite minimization problems with uniformly convex objective. We introduce the notion of second-order condition number of a certain degree and justify the linear rate of convergence in a nondegenerate case for the method with an adaptive estimate of the regularization parameter. The algorithm automatically achieves the best possible global complexity bound among different problem classes of uniformly convex objective functions with Hölder continuous Hessian of the smooth part of the objective. As a byproduct of our developments, we justify an intuitively plausible result that the global iteration complexity of the Newton method is always better than that of the gradient method on the class of strongly convex functions with uniformly bounded second derivative.

method. The following results provide a good perspective for the development of this approach, discovering accelerated [14], adaptive [4,5] and universal [10] schemes. The latter methods can automatically adjust to a smoothness properties of the particular objective function. In the same vein, the second-order algorithms for solving a system of nonlinear equations were discovered in [13], and randomized variants for solving large-scale optimization problems were proposed in [7][8][9]12,18].
Despite to a number of nice properties, global complexity bounds of the cubically regularized Newton method for the cases of strongly convex and uniformly convex objective are not still fully investigated, as well as the notion of second-order nondegeneracy (see discussion in Sect. 5 in [14]). We are going to address this issue in the current paper.
The rest of the paper is organized as follows. Section 2 contains all necessary definitions and main properties of the classes of uniformly convex functions and twicedifferentiable functions with Hölder continuous Hessian. We introduce the notion of the condition number γ f (ν) of a certain degree ν ∈ [0, 1] and present some basic examples.
In Sect. 3, we describe a general regularized Newton scheme and show the linear rate of convergence for this method on the class of uniformly convex functions with a known degree ν ∈ [0, 1] of nondegeneracy. Then, we introduce the adaptive cubically regularized Newton method and collect useful inequalities and properties, which are related to this algorithm.
In Sect. 4, we study global iteration complexity of the cubically regularized Newton method on the classes of uniformly convex functions with Hölder continuous Hessian. We show that for nondegeneracy of any degree ν ∈ [0, 1], which is formalized by the condition γ f (ν) > 0, the algorithm automatically achieves the linear rate of convergence with the value γ f (ν) being the main complexity factor.
Finally, in Sect. 5 we compare our complexity bounds with the known bounds for other methods and discuss the results. In particular, we justify an intuitively plausible (but quite a delayed) result that the global complexity of the cubically regularized Newton method is always better than that of the gradient method on the class of strongly convex functions with uniformly bounded second derivative.

Uniformly Convex Functions with Hölder Continuous Hessian
Let us start from some notation. In what follows, we denote by E a finite-dimensional real vector space and by E * its dual space, which is a space of linear functions on E. The value of function s ∈ E * at point x ∈ E is denoted by s, x . Let us fix some linear self-adjoint positive-definite operator B : E → E * and introduce the following Euclidean norms in the primal and dual spaces: For any linear operator A : E → E * , its norm is induced in a standard way: Our goal is to solve the convex optimization problem in the composite form: where f is a twice differentiable on its open domain uniformly convex function, and h is a simple closed convex function with dom h ⊆ dom f . Simple means that all auxiliary subproblems with an explicit presence of h are easily solvable.
For a smooth function f , its gradient at point x is denoted by ∇ f (x) ∈ E * , and its Hessian is denoted by ∇ 2 f (x) : E → E * . For convex but not necessary differentiable function h, we denote by ∂h(x) ⊂ E * its subdifferential at the point x ∈ dom h.
We say that differentiable function f is uniformly convex of degree p ≥ 2 on a convex set C ⊆ dom f if for some constant σ > 0 it satisfies inequality Uniformly convex functions of degree p = 2 are known as strongly convex. If inequality (2) holds with σ = 0, the function f is called just convex. The following convenient condition is sufficient for function f to be uniformly convex on a convex set C ⊆ dom f : Lemma 2.1 Lemma 1 in [14]) Let for some σ > 0 and p ≥ 2 the following inequality holds: Then, function f is uniformly convex of degree p on set C with parameter σ .
From now on, we assume C := dom F ⊆ dom f . By the composite representation (1), we have for every x ∈ dom F and for all F (x) ∈ ∂ F(x): Therefore, if σ > 0, then we can have only one point x * ∈ dom F with F(x * ) = F * , which always exists for F being uniformly convex and closed. A useful consequence of uniform convexity is the following upper bound for the residual.
Proof In view of (4), bound (5) follows as in the proof of Lemma 3 in [14].
It is reasonable to define the best possible constant σ in inequality (3) for a certain degree p. This leads us to a system of constants: We prefer to use inequality (3) for the definition of σ f ( p), instead of (2), because of its symmetry in x and y. Note that the value σ f ( p) also depends on the domain of F. However, we omit this dependence in our notation since it is always clear from the context. It is easy to see that the univariate function σ f (·) is log-concave. Thus, for all p 2 > p 1 ≥ 2 we have: For a twice-differentiable function f , we say that it has Hölder continuous Hessian of degree ν ∈ [0, 1] on a convex set C ⊆ dom f , if for some constant H, it holds: Two simple consequences of (8) are as follows: where Q(x; y) is the quadratic model of f at the point x: In order to characterize the level of smoothness of function f on the set C := dom F, let us define the system of Hölder constants (see [10]): We allow H f (ν) to be equal to +∞ for some ν. Note that function H f (·) is log-convex. Thus, any 0 ≤ ν 1 < ν 2 ≤ 1 such that H f (ν i ) < +∞, i = 1, 2, provide us with the following upper bounds for the whole interval: Let us give an example of function, which has Hölder continuous Hessian for all ν ∈ [0, 1].

Example 2.1
For a given a i ∈ E * , 1 ≤ i ≤ m, consider the following convex function: Let us fix Euclidean norm x = Bx, x 1/2 , x ∈ E, with operator B := m i=1 a i a * i . Without loss of generality, we assume that B 0 (otherwise we can reduce dimension of the problem). Then, Therefore, by (12) we get, for any ν ∈ [0, 1]: x . Let us fix arbitrary x, y ∈ E and direction h ∈ E. Then, straightforward computation gives: Hence, we get Since all Hessians of function f are positive definite, we conclude that H f (0) ≤ 1. Inequality H f (1) ≤ 2 can be easily obtained from the following representation of the third derivative: Let us imagine now that we want to describe the iteration complexity of some method, which solves the composite optimization problem (1) up to an absolute accuracy > 0 in the function value. We assume that the smooth part f of its objective is uniformly convex and has Hölder continuous Hessians. Which degrees p and ν should be used in our analysis? Suppose that, for the number of calls of the oracle, we are interested in obtaining a polynomial-time bound of the form: While x and f (x) can be measured in arbitrary physical quantities, the value "number of iterations" cannot have physical dimension. This leads to the following relations: Therefore, despite to the fact that our function can belong to several problem classes simultaneously, from the physical point of view only one option is available: Hence, for a twice-differentiable convex function f with inf ν∈[0,1] H f (ν) > 0, we can define only one meaningful condition number of degree ν ∈ [0, 1]: If for some particular ν we have H f (ν) = +∞, then by our definition: γ f (ν) = 0. It will be shown that the condition number γ f (ν) serves as a main factor in the global iteration complexity bounds for the regularized Newton method as applied to the problem (1). Let us prove that this number cannot be big.
In the case when dom F is unbounded: sup x∈dom F x = +∞, then Proof Indeed, for any x, y ∈ dom F, x = y, we have: Now, dividing both sides of this inequality by H f (ν), we get inequality (14) from the definition of H f (ν) (11). Inequality (15) can be obtained by taking the limit y → +∞.
From inequalities (7) and (12), we can get the following lower bound: where 0 ≤ ν 1 < ν 2 ≤ 1. However, it turns out that in unbounded case we can have a nonzero condition number γ f (ν) only for a single degree.
Proof Consider firstly the case: α > ν. From the condition γ f (ν) > 0, we conclude that H f (ν) < +∞. Then, for any x, y ∈ dom F we have: Dividing both sides of this inequality by y − x 2+α and letting x → +∞, we get Let us look now at an important example of a uniformly convex function with Hölder continuous Hessian. It is convenient to start with some properties of powers of Euclidean norm.

Lemma 2.5
For fixed real p ≥ 1, consider the following function: 2. If 1 ≤ p ≤ 2, then function f p (·) has ν-Hölder continuous gradient with ν = p−1: Proof Firstly, recall two useful inequalities, which are valid for all a, b ≥ 0: Let us fix arbitrary x, y ∈ E. The left-hand side of inequality (16) equals and we need to verify that it is bigger than The case x = 0 or y = 0 is trivial. Therefore, assume x = 0 and y = 0. Denoting τ := y x , r := Bx,y x · y , we have the following statement to prove: Since the function in the right-hand side is convex in r , we need to check only two marginal cases: Thus, we have proved (16). Let us prove the second statement. Consider the function In view of our first statement, we have: For arbitrary Then s i * = x i p−1 , and consequently, 1) For the integer values of p, this inequality was proved in [14].
Let us prove now that H f (ν) ≤ (1 + ν)2 1−ν for p = 2 + ν with some ν ∈ (0, 1]. This is The corresponding Hessians can be represented as follows: For the case x = y = 0, inequality (21) is trivial. Assume now that x = 0. If 0 ∈ [x, y], then y = −βx for some β ≥ 0 and we have: which is (21). Let 0 / ∈ [x, y]. For an arbitrary fixed direction h ∈ E, we get: Consider the points u = Bx x 1−ν = ∇ f q (x) and v = By y 1−ν = ∇ f q (y) with q = 1 + ν. Then, Therefore, Let us estimate the right-hand side of (22) from above. Consider a continuously differentiable univariate function: Note that Thus, we have: It remains to use the definition of u and v and apply inequality (17) with p = q. Thus, we have proved, that for p = 2+ν the Hessian of f is Hölder continuous of degree ν. At the same time, taking y = 0, we get These values cannot be uniformly bounded in x ∈ E by any multiple of x α with α = ν. So, the Hessian of f is not Hölder continuous for any degree different from 2 + ν. (16) and (17) have the following symmetric consequences:

Remark 2.1 Inequalities
which are valid for all x, y ∈ E.

Regularized Newton Method
Let us start from the case when we know that for a specific ν ∈ [0, 1] function f has Hölder continuous Hessian: H f (ν) < +∞. Then, from (10), we have the global upper bound for the objective function: Thus, it is natural to employ the minimum of a regularized quadratic model: and define the following general iteration process [10]: where the value H k is chosen either to be a constant from the interval [0, 2H f (ν)] or by some adaptive procedure. For the class of uniformly convex functions of degree p = 2 + ν, we can justify the following global convergence result for this process. Theorem 3.1 Assume that for some ν ∈ [0, 1] we have 0 < H f (ν) < +∞ and σ f (2 + ν) > 0. Let the coefficients {H k } k≥0 in the process (24) satisfy the following conditions: with some constant β ≥ 0. Then, for the sequence {x k } k≥0 generated by the process we have: Thus, the rate of convergence is linear and for reaching the gap F( Proof As in the proof of Theorem 3.1 in [10], from (25) one can see that for any α ∈ [0, 1]. Then, taking into account the uniform convexity (4), we get The minimum of the right-hand side is attained at α * = min 1+ν . Plugging this value into the bound above, we get inequality (26).
Unfortunately, in practice it is difficult to decide on an appropriate value of ν ∈ [0, 1] with H f (ν) < +∞. Therefore, it is interesting to develop the universal methods which are not based on some particular parameters. Recently, it was shown [10] that one good choice for such universal scheme is the cubic regularization of the Newton Method [17]. This is actually the process (24)

2: Perform the Cubic
Step: x k+1 = T H k 2 i k (x k ).

3: Set
Let us present the main properties of the composite Cubic Newton step x → T H (x). Denote M H (x; ·), it satisfies the following first-order optimality condition:

Since point T H (x) is a minimum of strictly convex function
In other words, the vector belongs to the subdifferential of h: Computation of a point T = T H (x), satisfying condition (28), requires some standard techniques of Convex Optimization and Linear Algebra (see [1,3,16,17]). Arithmetical complexity of such a procedure is usually similar to that of the standard Newton step. Plugging into (27) y := x ∈ dom F, we get: Thus, we obtain the following bound for the minimal value M * H (x) of the cubic model: If for some value ν ∈ [0, 1] the Hessian is Hölder continuous: H f (ν) < +∞, then by (9) and (28) we get the following bound for the subgradient: at the new point: One of the main strong point of the classical Newton's is its local quadratic convergence for the class of strongly convex functions with Lipschitz continuous Hessian: σ f (2) > 0 and 0 < H f (1) < +∞ (see, for example, [15]). This property holds for the cubically regularized Newton as well [14,17]. Indeed, ensuring F(T H (x)) ≤ M * H (x) as in Algorithm 1, and having H ≤ βH f (1) with some β ≥ 0, we get: And the region of quadratic convergence is as follows: .
After reaching it, the method starts to double the right digits of the answer at every step, and this cannot last for a long time. Therefore, from now on we are mainly interested in the global complexity bounds of Algorithm 1, which work for an arbitrary starting point x 0 . For noncomposite case, as it was shown in [10], if for some ν ∈ [0, 1] we have 0 < H f (ν) < +∞ and the objective is just convex, then Algorithm 1 with small initial Thus, the method in [10] has a sublinear rate of convergence on the class of convex functions with Hölder continuous Hessian. It can automatically adapt to the actual level of smoothness. In what follows we show that the same algorithm achieves linear rate of convergence for the class of uniformly convex functions of degree p = 2 + ν, namely for functions with strictly positive condition number: sup ν∈[0, 1] In the remaining part of the paper, we usually assume that the smooth part of our objective is not purely quadratic. This is equivalent to the condition inf ν∈[0,1] H f (ν) > 0. However, to conclude this section, let us briefly discuss the case min ν∈[0,1] H f (ν) = 0. If we would know in advance that f is a convex quadratic function, then no regularization is needed since a single step x → T H (x) with H := 0 solves the problem. However, if our function is given by a black-box oracle and we do not know a priori that its smooth part is quadratic, then we can still use Algorithm 1. For this case, we prove the following simple result.
iterations of Algorithm 1.
Proof In our case, the quadratic model coincides with the smooth part of the objective: Q(x; y) ≡ f (y), x, y ∈ E. Therefore, at every iteration k ≥ 0 of Algorithm 1 we have i k = 0 and y − x k 3 , and Let us prove that x k+1 − x * ≤ x k − x * for all k ≥ 0. If this is true, then plugging y ≡ x * into (33), we get: F(x k+1 ) − F * ≤ 2 −k H 0 6 x 0 − x * 3 which results in the estimate (32). Indeed, and it is enough to show that B(x k − x k+1 ), x * − x k+1 ≤ 0. Since x k+1 satisfies the first-order optimality condition: we have: where the last inequality follows from the convexity of the objective.

Complexity Results for Uniformly Convex Functions
In this section, we are going to justify the global linear rate of convergence of Algorithm 1 for a class of twice differentiable uniformly convex functions with Hölder continuous Hessian. Universality of this method is ensured by the adaptive estimation of the parameter H over the whole sequence of iterations. It is important to distinguish two cases: H k+1 < H k and H k+1 ≥ H k . First, we need to estimate the progress in the objective function after minimizing the cubic model. There are two different situations here: either Then, for arbitrary x ∈ dom F and H > 0 we have: Finally, by the smoothness and the uniform convexity, we obtain: We are ready to prove the main result of this paper.
where κ f (ν) is defined by (37). Let the sequence {x k } K k=0 generated by the method satisfy condition: Then, for every 0 ≤ k ≤ K − 1, we have: Therefore, the rate of convergence is linear, and Moreover, we have the following bound for the total number of oracle calls N K during the first K iterations: Proof The proof is based on Lemmas 4.1 and 4.2, and monotonicity of the sequence F(x k ) k≥0 . Firstly, we need to show that every iteration of the method is well-defined. Namely, we are going to verify that for a fixed 0 ≤ k ≤ K − 1, there exists a finite integer ≥ 0 such that either F(T H k 2 (x k )) ≤ M * H k 2 (x k ) or F(T H k 2 +1 (x k ))−F * < ε. Indeed, let us set Then, if we have both F(T H (x k )) > M * H (x k ) and F(T 2H (x k )) − F * ≥ ε, we get by Lemma 4.2: which contradicts (46). Therefore, if we are unable to find the value 0 ≤ i k ≤ (see line 1 of Algorithm) in a finite number of steps, that only means we have already solved the problem up to accuracy ε. Now, let us show that for every 0 ≤ k ≤ K it holds: This inequality is obviously valid for k = 0. Assume it is also valid for some k ≥ 0. Then, by definition of H k+1 (see line 3 of Algorithm), we have H k+1 = H k 2 i k −1 . There are two cases. 1) i k = 0. Then, H k+1 < H k . By monotonicity of F(x k ) k≥0 and by induction, we get: .
Finally, let us estimate the total number of the oracle calls N K during the first K iterations. At each iteration, the oracle is called i k + 1 times, and we have H k+1 = H k 2 i k −1 . Therefore, Note that condition (42) for the initial choice of H 0 can be seen as a definition of the moment, after which we can guarantee the linear rate of convergence (44). In practice, we can launch Algorithm 1 with arbitrary H 0 > 0. There are two possible options: either the method halves H k at every step in the beginning, so H k becomes small very quickly, or this value is increased at least once, and the required bound is guaranteed by Lemma 4.2. It can be easily proved, that this initial phase requires no more than K 0 = log 2 H 0 ε (1−ν)/(1+ν) κ f (ν) oracle calls.

Discussion
Let us discuss the global complexity results, provided by Theorem 4.1 for the Cubic Regularization of the Newton Method with the adaptive adjustment of the regularization parameter. For the class of twice continuously differentiable strongly convex functions with Lipschitz continuous gradients f ∈ S 2,1 μ,L (dom F), it is well known that the classical gradient descent method needs iterations for computing ε-solution of the problem (e.g., [15]). As it was shown in [6], this result is shared by a variant of Cubic Regularization of the Newton method. This is much better than the bound O L μ 2 log F(x 0 )−F * ε , known for the damped Newton method (e.g., [2]).
For the class of uniformly convex functions of degree p = 2 + ν having Hölder continuous Hessian of degree ν ∈ [0, 1], we have proved the following parametric estimates: O max γ f (ν)

Conclusions
In this work, we have introduced the second-order condition number of a certain degree, which plays as the main complexity factor for solving uniformly convex minimization problems with Hölder-continuous Hessian of the objective by second-order optimization schemes.
We have proved that cubically regularized Newton method with an adaptive estimation of the regularization parameter achieves global linear rate of convergence on this class of functions. The algorithm does not require to know any parameters of the problem class and automatically fits to the best possible degree of nondegeneracy.
Using this technique, we have justified that global iteration complexity of cubic Newton is always better than corresponding one of gradient method for the standard class of strongly convex functions with uniformly bounded second derivative.
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.