Gradient regularization of Newton method with Bregman distances

In this paper, we propose a first second-order scheme based on arbitrary non-Euclidean norms, incorporated by Bregman distances. They are introduced directly in the Newton iterate with regularization parameter proportional to the square root of the norm of the current gradient. For the basic scheme, as applied to the composite convex optimization problem, we establish the global convergence rate of the order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(k^{-2})$$\end{document}O(k-2) both in terms of the functional residual and in the norm of subgradients. Our main assumption on the smooth part of the objective is Lipschitz continuity of its Hessian. For uniformly convex functions of degree three, we justify global linear rate, and for strongly convex function we prove the local superlinear rate of convergence. Our approach can be seen as a relaxation of the Cubic Regularization of the Newton method (Nesterov and Polyak in Math Program 108(1):177–205, 2006) for convex minimization problems. This relaxation preserves the convergence properties and global complexities of the Cubic Newton in convex case, while the auxiliary subproblem at each iteration is simpler. We equip our method with adaptive search procedure for choosing the regularization parameter. We propose also an accelerated scheme with convergence rate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(k^{-3})$$\end{document}O(k-3), where k is the iteration counter.


Introduction
The classical Newton's method is a powerful tool for solving various optimization problems and for dealing with ill-conditioning.The practical implementation of this method for solving unconstrained minimization problem min x f (x) can be written as follows: where 0 < α k ≤ 1 is a damping parameter.However, this approach has two serious drawbacks.Firstly, the next point is not well-defined when the Hessian is not strictly positive-definite.And secondly, while the method has a very fast local quadratic convergence, it is difficult to establish any global properties for this process.Indeed, for α k = 1 (the classical pure Newton method), there are known examples of problems for which the method does not converge globally (see, e.g., Example 1.4.3 in [1]).For the damped Newton method with line search, it is possible to prove some global convergence rates.But, typically, they are worse than the rates of the classical Gradient Method [4].
A breakthrough in the second-order optimization theory was made after [5], where the Cubic Regularization of the Newton method was presented together with its global convergence properties.The main standard assumption is that the Hessian of the objective is Lipschitz continuous with some parameter L 2 ≥ 0: ensuring the global upper approximation of our function formed by the second-order Taylor polynomial augmented by the third power of the norm.The next point is then defined as the minimum of the upper model: (1.1) Till now, this idea has a full theoretical justification only for the Euclidean norm • .In this case, the solution to the auxiliary minimization problem (1.1) does not have a closed form expression, but it can be found by solving a one-dimensional nonlinear equation and by using the standard factorization tools of Linear Algebra.However, even in the Euclidean case, the presence of the cubic term in the objective prevents the usage of gradient-type methods (like the conjugate gradients, etc.).This drawback does not allow the application of method (1.1) to large-scale problems.
In this paper, we show how to avoid these restrictions.Namely, we will show that it is possible to use a quadratic regularization of the Taylor polynomial with a properly chosen coefficient that depends only on the current iterate.In the simplest form, one iteration of our method is as follows: where We see that it is very easy for implementation, since it requires only one matrix inversion, the very standard operation of Linear Algebra.At the same time, this subproblem is now suitable for the classical Congugate Gradient method as well. 1) It appears that for the optimization process (1.2),(1.3),we can establish the global convergence guarantees of the same type as for the Cubic Newton method (1.1).Namely, we prove the global rate of the order O(1/k 2 ) in terms of the functional residual and in terms of the subgradient norm for the general convex functions.This is much faster than the standard O(1/k)-rate of the Gradient Method.Moreover, for the uniformly convex functions of degree three, we prove the global linear rate.For the strongly convex functions we establish a local superlinear convergence.
Contents.In this paper, we consider optimization problems in a general composite form.We can work with arbitrary (possibly non-Euclidean) norms using the framework of Bregman distances.
In Section 2, we present the main properties of one iteration of the scheme.We study the convergence properties of the basic process in Section 3. In Section 4, we establish convergence rates for the norm of the gradient.A line search procedure for our scheme is discussed in Section 5.In Section 6, we consider an accelerated method based on the iterations of the basic process and justify its global complexity of the order Õ(ǫ −1/3 ) assuming Lipschitz continuity of the Hessian of the smooth part of the objective function.
Notation.Let us fix a finite-dimensional real vector space E. Our goal is to solve the following Composite Minimization Problem where ψ(•) is a simple closed convex function with dom ψ ⊆ E, and f (•) is a convex and two times continuously differentiable function.
We measure distances in E by a general norm • .Its dual space is denoted by E * .It is a space of all linear functions on E, for which we define the norm in the standard way: Using this norm, we can define an induced norm for a self-adjoint linear operator B : E → E * as follows: We can also define the bounds of its spectrum as the best values λ min (B) and λ max (B) satisfying conditions Our optimization schemes will be based on some scaling function d(•), which we assume to be a strongly convex function with Lipschitz-continuous gradients: 1) When this paper was already finished, we discovered that this idea was recently proposed by K. Mishchenko [3] for solving unconstrained minimization problem with smooth objective.As compared to his work, our main advances consist in the usage of Bregman distances, composite form of optimization problem, linear rate of convergence for uniformly convex functions, and developments of accelerated variant of the method.∇d(x) − ∇d(y) * ≤ x − y , (1.6) where σ ∈ (0, 1] and the points x, y ∈ dom ψ are arbitrary.For twice-differentiable scaling functions, this condition can be characterized by the following bounds on the Hessian: Using this function, we define the following Bregman distance: The standard condition for the smooth part of the objective function in problem (1.4) is Lipschitz continuity of the Hessians: This inequality has the following consequences, which are valid for all x, y ∈ dom ψ: (1.10)

Gradient regularization
Our main iteration at some point x ∈ dom ψ with a step-size A > 0 is defined as follows: The solution to this problem T = T A (x) is characterized by the following variational principle: (2.2) Note that this is a very special way of selecting subgradient of a possibly nonsmooth function F (•), which allows Let us prove the following fact.

Proof:
For optimization problem in (2.1), define the scaling function Note that the objective function in this problem is strongly convex relatively to ξ(•) with constant one.Therefore, In order to prove (2.5), note that Since M A (x, x) = F (x), we get (2.5) from (2.4) with y = x.

✷
In what follows, the parameter A in the optimization problem (2.1) is chosen as where H > 0 is an estimate of the Lipschitz constant L 2 in (1.8).This choice is explained by the following result.
Corollary 1 For A = A H (x), we have

.7)
Proof: Indeed, this is a simple consequence of inequality (2.5) and definition (2.1).✷ Let us relate the optimal value of the auxiliary problem (2.1) with the cubic overapproximation (1.10).
Lemma 2 Let A = A H (x) and T = T A (x). Assume that for some H > 0 the following condition is satisfied: (2.9) Thus, F (T ) ≤ M A (x) and (2.9) follows from (2.4) with y = x.✷ Finally, we need to estimate the norm of subgradient at the new point. where This is the first inequality in (2.10).For the second one, we can continue as follows: Now we can prove the main theorem of this section.
Theorem 1 Let A = A H (x) and T = T A (x).If for this point relation (2.8) is valid, then

Properties of the minimization process
Now we can analyze the following minimization process.
Initialization.Choose H ≥ L 2 , x 0 ∈ dom ψ, and 2).Compute x k+1 = T A k (x k ) and define Let us introduce the distance to the initial level set: which we assume to be bounded: D < +∞.We can prove the following convergence rate for method (3.1). Proof: Since for all k ≥ 1, the subgradients of ψ(•) are defined by the rule (2.3), we can use the results of Section 2. We can continue as follows: Summing up these inequalities, we get , we obtain inequality (3.2). ✷

Corollary 2
The second condition of Theorem 2 can be valid only for Remark 1 The right-hand side of inequality (3.4) can be used for defining the optimal value of parameter H. Indeed, it can be chosen as a minimizer of the following function:

This gives us
In this case, Let us estimate now the performance of method (3.1) on uniformly convex functions.Consider the case when function F (•) is uniformly convex of degree three: For the composite F (•), this property can be ensured either by its smooth component f (•), or by the general component ψ(•).In the latter case, it is not necessary to coordinate this assumption with the smoothness condition (1.8).
In our analysis, we need the following straightforward consequence of definition (3.7): (3.8) Theorem 3 Let F (•) satisfies condition (3.7).Then for all k ≥ 0 we have where S = 3 , , and this is inequality (3.9).✷ Remark 2 in accordance to the estimate (3.9), the highest rate of convergence corresponds to the maximal value of S. This means that we need to minimize the factor c 3/2 H 1/2 in H.The optimal value is given by H # = 3σL 2 .In this case, Finally, let us prove a superlinear rate of local convergence for the scheme (3.1).
Theorem 4 Let function f (•) be strongly convex on dom ψ with parameter µ > 0. If H ≥ L 2 , then, for any k ≥ 0 we have Proof: Indeed, for any k ≥ 0 we have Therefore, Thus, the region of superlinear convergence of method (3.1) is as follows: (3.12) Note that outside this region, the constant of strong convexity of the objective function in problem (2.1) with A = A H (x) satisfies the following lower bound: 4 Estimating the norm of the gradient Let us estimate the efficiency of method (3.1) in decreasing the norm of gradients.For that, we are going to derive an upper bound for the number of steps N of method (3.1), for which we still have In this section, we use notation of Section 3: Firstly, consider the case when the smooth component f (•) in the objective function of problem (1.4) satisfies condition (1.8).Then It is convenient to assume that the number of iteration N of the method is a multiple of three: Then for the last m iterations of the scheme we have 1/m (4.1) At the same time, for the first 2m iterations we obtain Hence, using inequality (4.4) and squared inequality (4.5), we obtain the following: Thus, we can prove the following theorem.
In other words, Thus, we have proved the following theorem.

Adaptive line search
The main advantage of the method (3.1) consists in its easy implementation.Indeed, in the case ψ(•) ≡ 0 with dom ψ = E, the iteration (2.1) is reduced mainly to matrix inversion, the very standard operation of Linear Algebra, which is available in the majority of software packages.However, for the better performance of this scheme, it is necessary to apply a dynamic strategy for updating the step-size coefficient H. Let us show how this can be done.Consider the following optimization method.

Gradient Regularization of Newton Method with Line Search
Initialization.Choose H 0 ≤ L 2 , x 0 ∈ dom ψ, and Note that this scheme does not depend on any particular value of the Lipschitz constant.By definitions of the updates and from inequality (1.10), we conclude that inequalities Hence, from Theorem 1, we have the following progress established for each iteration k ≥ 0: Repeating the reasoning of Theorem 2, we obtain the following complexity result.

Acceleration
Let us present a conceptual acceleration of our method, that is based on the contracting proximal iterations [2].First, we fix an auxiliary prox-function φ(•) that we assume to be uniformly convex of degree three with respect to the initial norm: At each iteration k ≥ 0 of the accelerated scheme, we form the following functions: Then, f ) .Therefore, for any h ∈ E: thus L 2 (g k+1 ) = 1, and we can minimize objective h k+1 very efficiently by using our method (3.1).Namely, in order to find a point v with a small norm of a subgradient: the method needs to do no more than N 2).Form the auxiliary objective h k+1 (•).Find a point v k+1 by method (3.1) such that g * ≤ δ for some g ∈ ∂h k+1 (v k+1 ).

3 k+1 B 2 k+1 D 3 f b k+1 x+B k x k B k+1 . 2 9L 2
def = g k+1 (x) + b k+1 ψ(x) + β φ (v k ; x), where {b k } k≥1 is a sequence of positive numbers, B k def = k i=1 b i , B 0 def = 0, and{x k } k≥0 , {v k } k≥0 , x 0 = v 0 ,are sequences of trial points that belong to dom ψ.Note that the derivatives of g k+1 (•) and f (•) are related as follows: D 3 g k+1 (x) ≡ b For simplicity of the presentation, we assume that f is three times differentiable on the open set containing dom ψ.Let us choose b k := k (f ) .