Abstract
In this paper, we propose a first second-order scheme based on arbitrary non-Euclidean norms, incorporated by Bregman distances. They are introduced directly in the Newton iterate with regularization parameter proportional to the square root of the norm of the current gradient. For the basic scheme, as applied to the composite convex optimization problem, we establish the global convergence rate of the order \(O(k^{-2})\) both in terms of the functional residual and in the norm of subgradients. Our main assumption on the smooth part of the objective is Lipschitz continuity of its Hessian. For uniformly convex functions of degree three, we justify global linear rate, and for strongly convex function we prove the local superlinear rate of convergence. Our approach can be seen as a relaxation of the Cubic Regularization of the Newton method (Nesterov and Polyak in Math Program 108(1):177–205, 2006) for convex minimization problems. This relaxation preserves the convergence properties and global complexities of the Cubic Newton in convex case, while the auxiliary subproblem at each iteration is simpler. We equip our method with adaptive search procedure for choosing the regularization parameter. We propose also an accelerated scheme with convergence rate \(O(k^{-3})\), where k is the iteration counter.
1 Introduction
The classical Newton’s method is a powerful tool for solving various optimization problems and for dealing with ill-conditioning. The practical implementation of this method for solving unconstrained minimization problem \(\min \nolimits _{x} f(x)\) can be written as follows:
where \(0 < \alpha _k \le 1\) is a damping parameter. However, this approach has two serious drawbacks. Firstly, the next point is not well-defined when the Hessian is a degenerate matrix. And secondly, while the method has a very fast local quadratic convergence, it is difficult to establish any global properties for this process. Indeed, for \(\alpha _k = 1\) (the classical pure Newton method), there are known examples of problems for which the method does not converge globally [5]. The pure Newton step might not work even if the objective is strongly convex (see, e.g., Example 1.4.3 in [6]). For the damped Newton method with line search, it is possible to prove some global convergence rates. But, typically, they are worse than the rates of the classical Gradient Method [18].
A breakthrough in the second-order optimization theory was made after [19], where the Cubic Regularization of the Newton method was presented together with its global convergence properties. The main standard assumption is that the Hessian of the objective is Lipschitz continuous with some parameter \(L_2 \ge 0\):
ensuring the global upper approximation of our function formed by the second-order Taylor polynomial augmented by the third power of the norm. The next point is then defined as the minimum of the upper model:
Initially, this idea had a full theoretical justification only for the Euclidean norm \(\Vert \cdot \Vert \). In this case, the solution to the auxiliary minimization problem (1) does not have a closed form expression, but it can be found by solving a one-dimensional nonlinear equation and by using the standard factorization tools of Linear Algebra. The use of general and even variable norms with cubic regularization in second-order methods was considered recently in [11, 15], which can be useful for solving optimization problems with non-Euclidean geometry.
However, even in the Euclidean case, the presence of the cubic term in the objective makes it more difficult to use the classical gradient-type methods with their developed complexity theory. While it is possible to apply the gradient descent [2], the cubic subproblem prevents the usage of the standard accelerated and conjugate gradients methods. This drawback restricts the application of method (1) to large-scale problems.
In this paper, we show how to avoid these restrictions. Namely, we will show that it is possible to use a quadratic regularization of the Taylor polynomial with a properly chosen coefficient that depends only on the current iterate. In the simplest form, one iteration of our method is as follows:
where
We see that it is very easy for implementation, since it requires only one matrix inversion, the very standard operation of Linear Algebra. At the same time, this subproblem is now suitable for the classical Congugate Gradient method as well.Footnote 1
For the class of Trust Region methods as applied to unconstrained minimization problems, the trust region radius proportional to the gradient norm was proposed in [9]. The use of the gradient norm as a regularizer for the Newton method was considered in the work [20]. Then, the method has a local quadratic convergence. However, to ensure some global rate for such regularization, one need to use damping steps, which makes the rate slower.
It appears that for the optimization process (2), (3), we can establish the global convergence guarantees of the same type as for the Cubic Newton method (1). Namely, we prove the global rate of the order \(O(1/k^2)\) in terms of the functional residual and in terms of the subgradient norm for the general convex functions. This is much faster than the standard O(1/k)-rate of the Gradient Method. Moreover, for the uniformly convex functions of degree three, we prove the global linear rate. For the strongly convex functions we establish a local superlinear convergence.
In this paper, we consider convex optimization problems in a general composite form. Recently, globally convergent Newton methods for nonsmooth optimization were proposed in [13]. They are based on the damping steps and regularization by the gradient norm, which is different from the rule (3).
We also work with arbitrary (possibly non-Euclidean) norms by employing the technique of Bregman distances. An alternative approach of using general norms in the cubically regularized Newton scheme was proposed in [11], that uses the adaptive regularization framework of [3].
Contents. The rest of the paper is organized as follows. In Sect. 2, we present the main properties of one iteration of the scheme.
We study the convergence rate of the basic process in Sect. 3. In Sect. 4, we establish convergence for the norm of the gradient. An adaptive search procedure for our method is discussed in Sect. 5.
In Sect. 6, we consider an accelerated scheme based on the iterations of the basic method and justify its global complexity of the order \({\tilde{O}}(\epsilon ^{-1/3})\) assuming Lipschitz continuity of the Hessian of the smooth part of the objective function. Section 7 contains numerical experiments. Some concluding remarks are in Sect. 8.
Notation. Let us fix a finite-dimensional real vector space \({\mathbb {E}}\). Our goal is to solve the following Composite Minimization Problem
where \(\psi (\cdot )\) is a simple closed convex function with \(\textrm{dom}\,\psi \subseteq {\mathbb {E}}\), and \(f(\cdot )\) is a convex and two times continuously differentiable function.
We measure distances in \({\mathbb {E}}\) by a general norm \(\Vert \cdot \Vert \). Its dual space is denoted by \({\mathbb {E}}^{*}\). It is a space of all linear functions on \({\mathbb {E}}\), for which we define the norm in the standard way:
Using this norm, we can define an induced norm for a self-adjoint linear operator \(B: {\mathbb {E}}\rightarrow {\mathbb {E}}^*\) as follows:
We can also define the bounds of its spectrum as the best values \(\lambda _{\min }(B)\) and \(\lambda _{\max }(B)\) satisfying conditions
Our optimization schemes will be based on some scaling function \(d(\cdot )\), which we assume to be a strongly convex function with Lipschitz-continuous gradients:
where \(\sigma \in (0,1]\) and the points \(x, y \in \textrm{dom}\,\psi \) are arbitrary. For twice-differentiable scaling functions, this condition can be characterized by the following bounds on the Hessian:
Using this function, we define the following Bregman distance:
We will employ this object to regularize the second-order model of the objective.
The standard condition for the smooth part of the objective function in problem (4) is Lipschitz continuity of the Hessians:
that we always assume to be satisfied. This inequality has the following consequences, which are valid for all \(x,y \in \textrm{dom}\,\psi \):
and
2 Gradient regularization
Our main iteration at some point \({\bar{x}} \in \textrm{dom}\,\psi \) with a step-size \(A>0\) is defined as follows:
This is minimization of a convex quadratic function augmented by Bregman distance and the composite part. Our main structural assumption is that both \(\rho (\bar{x}, \cdot )\) and \(\psi (\cdot )\) are simple, meaning that problem (11) is efficiently solvable.
The use of the general scaling function \(d( \cdot )\) can be beneficial in practice for solving problems with some specific non-Euclidean geometry.
Example 1
Let \(\psi (x) \equiv 0\) and the scaling function is \(d(x):= \frac{1}{2}\langle B x, x \rangle \) for a fixed positive definite self-adjoint operator \(B = B^{*} \succ 0\). Then,
and one iteration (11) can be written in an explicit form, as follows:
Example 2
Consider the unconstrained minimization problem \(\min _{x \in {\mathbb {R}}^n} f(x)\), with
where \(g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}\) is a convex smooth function. Let us fix the standard Euclidean norm \(\Vert \cdot \Vert _2\) in \({\mathbb {R}}^m\) and assume that the Hessian of g is Lipschitz continuous w.r.t. this norm with constant \(L_g\). Then, if we use the standard Euclidean norm \(\Vert \cdot \Vert _2\) for our primal space \({\mathbb {R}}^n\), the corresponding Lipschitz constant of \(\nabla ^2 f(\cdot )\) is
At the same time, using the following scaled norm \(\Vert x \Vert := \langle Bx, x\rangle ^{1/2}\), \(x \in {\mathbb {R}}^n\) with matrix \(B = C^T C\) (assuming \(B \succ 0\), so the rows of C have a full rank) and the scaling function from the previous example, we have
which is much better.
Example 3
Let \(\psi (\cdot )\) be \(\{0, +\infty \}\)-indicator of the standard simplex
Thus, problem (4) is to minimize a smooth convex function over this set:
One of the most suitable choices of the norm for this problem is \(\ell _1\)-norm [1], defined as \(\Vert x \Vert _1 {\mathop {=}\limits ^{\textrm{def}}}\sum _{i = 1}^n |x^{(i)}|\) for \(x \in {\mathbb {R}}^n\). The Lipschitz constant w.r.t. this norm is smaller than that one measured in \(\ell _2\)-norm. Let us fix some \(\delta > 0\), and use the following scaling function,
We have, for any \(h \in {\mathbb {R}}^n\) and \(x \in \varDelta _n\):
And, by Cauchy-Schwarz inequality, it holds
Hence,
and conditions (5), (6) are satisfied with \(\sigma = \frac{\delta }{1 + n \delta } = \frac{1}{1/\delta + n}\).
In general, the solution to this problem \(T = T_A({\bar{x}})\) is characterized by the following variational principle (see, e.g. [18]):
Thus, defining \(\psi '(T) = - \nabla f({\bar{x}}) - \nabla ^2f({\bar{x}})(T-{\bar{x}}) - A (\nabla d(T) - \nabla d({\bar{x}}))\), we see that \(\psi '(T) \in \partial \psi (T)\). Consequently,
Note that this is a very special way of selecting subgradient of a possibly nonsmooth function \(F(\cdot )\), which allows \(\Vert F'(T) \Vert _*\) approach zero.
Denote \(M_A({\bar{x}}) = M_A({\bar{x}}, T_A({\bar{x}})) \le M_A({\bar{x}}, {\bar{x}}) = F({\bar{x}})\). Let us prove the following important fact, that uses convexity of the original problem (4).
Lemma 1
For all \(y \in \textrm{dom}\,\psi \) and \(T = T_A({\bar{x}})\), we have
Moreover,
where \(F'({\bar{x}}) = \nabla f({\bar{x}}) + \psi '({\bar{x}})\) and \(\psi '({\bar{x}})\) is an arbitrary element of \(\partial \psi ({\bar{x}})\).
Proof
For optimization problem in (11), define the scaling function
Note that the objective function in this problem is strongly convex relatively to \(\xi (\cdot )\) with constant one. Therefore, for any \(y \in \textrm{dom}\,\psi \),
In order to prove (15), note that
Since \(M_A({\bar{x}}, {\bar{x}}) = F({\bar{x}})\), we get (15) from (14) with \(y = {\bar{x}}\). \(\square \)
In what follows, the parameter A in the optimization problem (11) is chosen as
where \(H > 0\) is an estimate of the Lipschitz constant \(L_2\) in (8). This choice is explained by the following result.
Corollary 1
For \(A = A_H({\bar{x}})\), we have
Proof
Indeed, this is a simple consequence of inequality (15) and definition (11). \(\square \)
Let us relate the optimal value of the auxiliary problem (11) with the cubic over-approximation (10).
Lemma 2
Let \(A = A_H({\bar{x}})\) and \(T = T_A({\bar{x}})\). Assume that for some \(H > 0\) the following condition is satisfied:
(clearly, it holds for \(H \ge L_2\), where \(L_2\) is the Lipschitz constant of the Hessian). Then
Proof
Indeed,
Thus, \(F(T) \le M_A({\bar{x}})\) and (19) follows from (14) with \(y = {\bar{x}}\). \(\square \)
Finally, we need to estimate the norm of subgradient at the new point.
Lemma 3
Let the Hessian be Lipschitz continuous with constant \(L_2\). Fix arbitrary \(H > 0\). Let \(A = A_H({\bar{x}})\) and \(T = T_A({\bar{x}})\). Then
where
Proof
Indeed,
This is the first inequality in (20). For the second one, we can continue as follows:
\(\square \)
Now we can prove the main theorem of this section.
Theorem 1
Let the Hessian be Lipschitz continuous with constant \(L_2\). Fix arbitrary \(H > 0\). Let \(A = A_H({\bar{x}})\) and \(T = T_A({\bar{x}})\). If for this point relation (18) is valid, then
Proof
We only need to insert in (19) the first inequality of (20) and definition (16). \(\square \)
3 Properties of the minimization process
In this section, we propose an iterative scheme based on the gradient regularization of the Newton steps. Note that the choice of the regularization parameter (16) depends solely on the current gradient norm and it can be easily computed at each iteration. Then, we do one regularized Newton step defined by (11). According to Theorem 1, repeating this process would result in monotone decrease of the objective.
First, we prove global convergence for the function value. In the next section, we also prove the convergence in terms of the gradient norm. Thus, small gradient norm can serve as a stopping criteria for our scheme.
Let us analyze the following algorithm with a fixed value of parameter H.

Let us introduce the distance to the initial level set:
which we assume to be bounded: \(D < +\infty \). We can prove the following convergence rate for method (22).
Theorem 2
Let the Hessian be Lipschitz continuous with constant \(L_2\). Let \(H \ge L_2\) and \(F(x_k) - F^* \ge \epsilon \) for some \(k \ge 0\). Then,
Proof
Denote \(F_k = F(x_k) - F(x^*)\) and \(g_k = \Vert F'(x_k) \Vert _*\). Thus, \(F_k \le D g_k\). Note that
Since for all \(k \ge 1\), the subgradients of \(\psi (\cdot )\) are defined by the rule (13), we can use the results of Sect. 2. We can continue as follows:
Summing up these bounds and using the inequality of arithmetic and geometric means, we get
Since
we obtain inequality (23). \(\square \)
Corollary 2
The second condition of Theorem 2 can be valid only for
Remark 1
Note that up to the additive logarithmic term, the iteration complexity (25) corresponds to that one of the Cubically regularized Newton method as applied to convex functions [19] in the Euclidean case. However, iterations of our method (22) are easier to implement, and it is also possible to use an arbitrary scaling function \(d(\cdot )\).
Remark 2
The right-hand side of inequality (25) can be used for defining the optimal value of parameter H. Indeed, it can be chosen as a minimizer of the following function:
This gives us
In this case,
Let us estimate now the performance of method (22) on uniformly convex functions. Consider the case when function \(F(\cdot )\) is uniformly convex of degree three:
For the composite \(F(\cdot )\), this property can be ensured either by its smooth component \(f(\cdot )\), or by the general component \(\psi (\cdot )\). In the latter case, it is not necessary to coordinate this assumption with the smoothness condition (8).
In our analysis, we need the following straightforward consequence of definition (28):
Theorem 3
Let the Hessian be Lipschitz continuous with constant \(L_2\). Let \(F(\cdot )\) satisfies condition (28). If \(H \ge L_2\), then for all \(k \ge 0\) we have
where \(S = {3 \sqrt{3} \over 4 c^{3/2} } \sqrt{\sigma _3 \over H}\).
Proof
As in the proof of Theorem 2, denote \(F_k = F(x_k) - F^*\) and \(g_k = \Vert F'(x_k) \Vert _*\). Then, we have
where \(S = {3 \over 4c^{3/2}} \sqrt{3 \sigma _3 \over H}\). Denote \(\tau _k = \sqrt{g_{k+1} \over c g_k } {\mathop {\le }\limits ^{(20)}} 1\). Since \(\ln (\cdot )\) is a concave function, we have \(\ln (1+S \tau _k) \ge \tau _k \ln (1+S)\). Hence,
Note that \(\left( {g_k \over g_0}\right) ^{1/(2k)} = \exp \left( - {1 \over 2k} \ln {g_0 \over g_k} \right) \ge 1 + {1 \over 2k} \ln {g_k \over g_0} \ge 1 + {1 \over 2k} \ln {F_k \over g_0 D} = 1 - {1 \over 2k} \xi _k\). Thus,
and this is inequality (30). \(\square \)
Remark 3
in accordance to the estimate (30), the highest rate of convergence corresponds to the maximal value of S. This means that we need to minimize the factor \(c^{3/2} H^{1/2}\) in H. The optimal value is given by \(H_{\#} = {3 \sigma }L_2\). In this case,
Note that this condition number also corresponds to the global convergence of the Cubically regularized Newton method [8].
Finally, let us prove a superlinear rate of local convergence for scheme (22).
Theorem 4
Let the Hessian be Lipschitz continuous with constant \(L_2\). Let function \(f(\cdot )\) be strongly convex on \(\textrm{dom}\,\psi \) with parameter \(\mu > 0\). If \(H \ge L_2\), then for all \(k \ge 0\) we have
Proof
Indeed, for any \(k \ge 0\) we have
Therefore,
\(\square \)
Thus, the region of superlinear convergence of method (22) is as follows:
Note that outside this region, the constant of strong convexity of the objective function in problem (11) with \(A = A_H(x)\) satisfies the following lower bound:
4 Estimating the norm of the gradient
Let us estimate the efficiency of method (22) in decreasing the norm of gradients. For that, we are going to derive an upper bound for the number of steps N of method (22), for which we still have
We will see that global complexities of our method for minimizing the gradient norm in convex case are the same as that one of the basic Cubic Newton [10].
In this section, we use notation of Sect. 3:
Firstly, consider the case when the smooth component \(f(\cdot )\) in the objective function of problem (4) satisfies condition (8). Then
It is convenient to assume that the number of iteration N of the method is a multiple of three:
Then for the last m iterations of the scheme we have
At the same time, for the first 2m iterations we obtain
Therefore,
Note that the power of \(g_{2m}\) in the last term is equal to that one of \(\frac{1}{g_{2m}}\) in (38). This explains our choice 2m for the length of the first stage.
Hence, using both inequalities (38) and (40), we obtain the following:
Note that \(g_{2m} {\mathop {\le }\limits ^{(20)}} c^{2m} g_0\). Therefore,
and we obtain
Thus, we can prove the following theorem.
Theorem 5
Let the Hessian be Lipschitz continuous with constant \(L_2\). Fix \(H \ge L_2\) and some \(\delta >0\). Then, the number of iterations of method (22) to reach small norm of the gradient \(\Vert F'(x_N) \Vert _{*} \le \delta \) satisfies the following bound:
Proof
Indeed,
and this is inequality (41). \(\square \)
Finally, let us estimate the efficiency of method (22) under additional assumption of uniform convexity (28). From the proof of Theorem 3, we know that
On the other hand,
Thus,
In other words,
Thus, we have proved the following theorem.
Theorem 6
Let the Hessian be Lipschitz continuous with constant \(L_2\). Let \(F(\cdot )\) satisfies condition (28). Fix \(H \ge L_2\) and some \(\delta >0\). Then, the number of iterations of method (22) to reach small norm of the gradient \(\Vert F'(x_N) \Vert _{*} \le \delta \) satisfies the following bound:
5 Adaptive search procedure
The main advantage of the method (22) consists in its easy implementation. Indeed, in the case \(\psi (\cdot ) \equiv 0\) with \(\textrm{dom}\,\psi = {\mathbb {E}}\), the iteration (11) is reduced mainly to matrix inversion, the very standard operation of Linear Algebra, which is available in the majority of software packages. However, for the better performance of this scheme, it is necessary to apply a dynamic strategy for updating the step-size coefficient H. Let us show how this can be done.

For the initialization, we need an initial guess \(H_0\) for the regularization parameter, which can be an arbitrary sufficiently small number.
Note that this scheme does not depend on any particular value of the Lipschitz constant. By definitions of the updates and from inequality (10), we conclude that inequalities \(H_0 \le H_k \le L_2\) and \(2^{i_k} H_k \le 2L_2\) imply \(H_{k+1} \le L_2\). Thus,
Hence, from Theorem 1, we have the following progress established for each iteration \(k \ge 0\):
where
Repeating the reasoning of Theorem 2, we obtain the following complexity result.
Theorem 7
Let the Hessian be Lipschitz continuous with constant \(L_2\). Let \(F(x_k) - F^* \ge \epsilon \) for some iteration \(k \ge 0\) of method (43). Then,
Note that some scaling of the domain or the target objective may affect the fixed choice of regularization parameter in the basic scheme (22). At the same time, we expect the adaptive method (43) to be robust with respect to these changes.
6 Acceleration
Let us present a conceptual acceleration of our method, that is based on the contracting proximal iterations [7].
First, we fix an auxiliary prox-function \(\phi (\cdot )\) that we assume to be uniformly convex of degree three with respect to the initial norm:
At each iteration \(k \ge 0\) of the accelerated scheme, we form the following functions:
where \(\{ b_k \}_{k \ge 1}\) is a sequence of positive numbers, \(B_k {\mathop {=}\limits ^{\textrm{def}}}\sum \nolimits _{i = 1}^k b_i\), \(B_0 {\mathop {=}\limits ^{\textrm{def}}}0\), and
are sequences of trial points that belong to \(\textrm{dom}\,\psi \).
Note that the derivatives of \(g_{k + 1}( \cdot )\) and \(f(\cdot )\) are related as follows:
For simplicity of the presentation, we assume that f is three times differentiable on the open set containing \(\textrm{dom}\,\psi \). Let us choose
Then, \(B_k = \frac{1}{9 L_2(f)} \sum \limits _{i = 1}^k i^2 \ge \frac{k^3}{27 L_2(f)}\). Therefore, for any \(h \in {\mathbb {E}}\):
thus \(L_2(g_{k + 1}) = 1\), and we can minimize objective \(h_{k + 1}\) very efficiently by using our method (22). Namely, in order to find a point v with a small norm of a subgradient:
the method needs to do no more than
steps, where \({\tilde{O}}(\cdot )\) hides absolute constants and logarithmic factors that depends on the initial residual and subgradient norm.
Let us write down the accelerated algorithm.

Applying directly Theorem 3.2 and the corresponding Corollary 3.3 from [7], we get the following complexity bound.
Theorem 8
Let the Hessian be Lipschitz continuous with constant \(L_2(f)\). Let us set \(\delta = \frac{1}{2 \cdot 3^{7/3}} \cdot \bigl (\frac{\epsilon }{L_2(f)}\bigr )^{2/3}\) in method (46), and let
Then, \(F(x_k) - F^{*} \le \epsilon . \) \(\square \)
7 Experiments
In this section, let us present computational results for solving the unconstrained minimization problem,
with objective that is a smooth convex approximation of pointwise maximum:
The problems of this type are important in applications with minimax strategies for matrix games and \(\ell _{\infty }\)-regression [17].
The vectors \(\{ a_i \in {\mathbb {R}}^n \}_{i=1}^m\) and numbers \(\{ b_i \in {\mathbb {R}}\}_{i = 1}^m\) are given data, while \(\mu > 0\) is a fixed parameter of smoothing.
Let us fix matrix \(B:= \sum _{i = 1}^m a_i a_i^T\), which we assume to be positive definite (otherwise, it is possible to reduce the dimensionality of the initial problem), and we use the following Euclidean norms:
respectively for the variables and for the gradients. We also know the corresponding Lipschitz constant for the Hessian, that is (see, e.g. Example 1.3.6 in [6])
To generate the data, we sample random elements \(\{\bar{a}_i \in {\mathbb {R}}^n, b_i \in {\mathbb {R}}\}_{i = 1}^m\) from the uniform distribution on \([-1, 1]\), and form an auxiliary function
Then, we set
Thus we ensure to have the optimum at the origin, since \(\nabla f(0) = 0\). We start the methods from \(x_0:= (1, 1, \ldots , 1)\).
We study the performance of the Newton method with Gradient regularization and with Cubic regularization [19] on this problem. Also, we compare our accelerated scheme (46) with the basic methods.
We use the following scaling function for this problem, as in Example 1:
The subproblem in our methods is solved exactly by using the standard matrix inversion. For the Cubic Newton, one need to find a root of a one-dimensional nonlinear equation at each step (see Section 5 in [19]). To solve it, we apply the classical univariate Newton method and use the value \(\epsilon = 10^{-8}\) as a target tolerance in terms of the function value.
Regularization parameter is fixed according to the theory (47). The results are shown in Fig. . We see that both algorithms show reasonably good performance, which is better than the theoretical prediction of the global behaviour. The Newton method with Gradient regularization possesses the best convergence rate. Accelerated scheme has an improvement in the rate in the beginning, but the basic methods are better for the higher level of the accuracy due to their superlinear local convergence.
In the following experiment, we compare the uses of the fixed Lipschitz constant with the adaptive search procedure for our method. The results are shown in Fig. . We see that the adaptive methods show the best performance. At the same time, iterations of the Gradient regularization are much cheaper which results in better computational time.
Finally, we compare our approach with iterations of the damped Newton method with line search. For this problem, the Hessian is often degenerate, thus we use a small perturbation to correct the matrix. Namely, we consider the following iterations:
where \(\tau \) is a fixed small parameter (we set \(\tau = 10^{-6}\) which was tuned to have the best performance), and \(\alpha _k\) is chosen by the standard backtracking line search to satisfy the following condition:
The results are presented in Fig. . We see that the damped Newton method is sensitive to the choice of perturbation parameter \(\tau \), while the method with Gradient regularization shows the most robust and efficient performance for all problem instances.
8 Discussion
In this paper, we have analyzed the global behaviour of the Newton method with a general Bregman regularizer, whose regularization parameter is chosen to be proportional to the square root of the current gradient norm.
We demonstrated that our scheme works with the composite form of the convex optimization problem. For the Euclidean norms, this approach can be seen as a relaxation of the Cubically regularized Newton method, achieving the same global convergence rates.
A significant advantage of the gradient regularization scheme is a simpler structure of the subproblem, which does not need auxiliary one-dimensional minimizations that are required in the cubic regularization. As a consequence, the subproblem becomes suitable for the large-scale case as for employing the Conjugate Gradient methods.
It is a favorable feature of our methods that regularization parameter always depends on the current iterate only. Therefore, it seems to be convenient for the use in stochastic optimization. We believe that this property could fit well with the broad family of stochastic second-order methods based on the Cubic regularization (see [4, 12, 14]). We keep the development of such schemes for further investigation.
Another important direction is an extension of our results to nonconvex optimization problems. It seems to be a challenging question since our current analysis heavily relies on positive semidefiniteness of the Hessian. It is needed to ensure a bound for the step length (see Lemma 1). Therefore, to tackle nonconvex problems, some modifications of our analysis have to be made.
Notes
When this paper was already finished, we discovered that this idea was recently proposed by K. Mishchenko [16] for solving unconstrained minimization problem with smooth objective. As compared to his work, our main advances consist in the usage of Bregman distances, composite form of optimization problem, linear rate of convergence for uniformly convex functions, and developments of accelerated variant of the method.
References
Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization (2020). SIAM, Philadelphia (2021)
Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 169(2), 337–375 (2018)
Dennis, J.E., Jr., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM, Philadelphia (1996)
Doikov, N.: New second-order and tensor methods in convex optimization. Ph.D. thesis, Université catholique de Louvain (2021)
Doikov, N., Nesterov, Y.: Contracting proximal methods for smooth convex optimization. SIAM J. Optim. 30(4), 3146–3169 (2020)
Doikov, N., Nesterov, Y.: Minimizing uniformly convex functions by cubic regularization of newton method. J. Optim. Theory Appl. 1–23 (2021)
Fan, J.Y, Ai, W.B., Zhang, Q.Y.: A line search and trust region algorithm with trust region radius converging to zero. J. Comput. Math. 865–872 (2004)
Grapiglia, G.N., Nesterov, Y.: Tensor methods for finding approximate stationary points of convex functions. Optim. Methods Softw. 1–34 (2020)
Gratton, S., Toint, P.L.: Adaptive regularization minimization algorithms with non-smooth norms and Euclidean curvature. arXiv preprint arXiv:2105.07765 (2021)
Hanzely, F., Doikov, N., Richtárik, P., Nesterov, Y.: Stochastic subspace cubic Newton method. In: International Conference on Machine Learning, pp. 4027–4038. PMLR (2020)
Khanh, P.D., Mordukhovich, B., Phat, V.T., Tran, B.D.: Globally convergent coderivative-based generalized Newton methods in nonsmooth optimization. arXiv preprint arXiv:2109.02093 (2021)
Kovalev, D., Mishchenko, K., Richtárik, P.: Stochastic Newton and cubic Newton methods with simple local linear-quadratic rates. arXiv preprint arXiv:1912.01597 (2019)
Martínez, J.M., Raydan, M.: Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization. J. Glob. Optim. 68(2), 367–385 (2017)
Mishchenko, K.: Regularized Newton method with global O(1/k\(^{2}\)) convergence. In: Proceedings of the Beyond First-Order Methods in ML Systems Workshop at the 38th International Conference on Machine Learning, vol. 139. PMLR (2021)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Lectures on Convex Optimization, vol. 137. Springer, Berlin (2018)
Nesterov, Y., Polyak, B.: Cubic regularization of Newton’s method and its global performance. Math. Program. 108(1), 177–205 (2006)
Polyak, R.: Regularized Newton method for unconstrained convex optimization. Math. Program. 120(1), 125–145 (2009)
Acknowledgements
We are very thankful to Associate Editor and two anonymous referees for valuable comments that significantly improved the initial version of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 788368). It was also supported by Multidisciplinary Institute in Artificial intelligence MIAI@Grenoble Alpes (ANR-19-P3IA-0003).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Doikov, N., Nesterov, Y. Gradient regularization of Newton method with Bregman distances. Math. Program. (2023). https://doi.org/10.1007/s10107-023-01943-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10107-023-01943-7
Keywords
- Newton method
- Regularization
- Convex optimization
- Global complexity bounds
- Large-scale optimization
Mathematics Subject Classification
- 49M15
- 49M37
- 58C15
- 90C25
- 90C30