Abstract
This paper studies the numerical solution of strictly convex unconstrained optimization problems by linesearch Newton-CG methods. We focus on methods employing inexact evaluations of the objective function and inexact and possibly random gradient and Hessian estimates. The derivative estimates are not required to satisfy suitable accuracy requirements at each iteration but with sufficiently high probability. Concerning the evaluation of the objective function we first assume that the noise in the objective function evaluations is bounded in absolute value. Then, we analyze the case where the error satisfies prescribed dynamic accuracy requirements. We provide for both cases a complexity analysis and derive expected iteration complexity bounds. We finally focus on the specific case of finite-sum minimization which is typical of machine learning applications.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this paper we consider globally convergent Inexact Newton methods for solving the strictly convex unconstrained optimization problem
We focus on the Newton method where the linear systems are solved by the Conjugate Gradient (CG) method [25], usually denoted as Newton-CG method, and on the enhancement of its convergence properties by means of Armijo-type conditions.
The literature on globally convergent Newton-CG methods is well established as long as the gradient and the Hessian matrix are computed exactly or approximated sufficiently accurately in a deterministic way, see e.g., [15, 20, 31]. On the other hand, the research is currently very active for problems with inexact information on f and its derivatives and possibly such that the accuracy cannot be controlled in a deterministic way [1,2,3,4,5,6,7,8,9,10,11, 13, 14, 17, 19, 22, 23, 26, 32, 33, 35].
This work belongs to the recent stream of works and addresses the solution of (1) when the objective function f is computed with noise and gradient and Hessian estimates are random. Importantly, derivative estimates are not required to satisfy suitable accuracy requirements at each iteration but with sufficiently high probability. Concerning the evaluation of f we cover two cases: estimates of f subject to noise that is bounded in absolute value; estimates of f subject to a controllable error, i.e., computable with a prescribed dynamic accuracy. Such a class of problems has been considered in [4, 10] and our contribution consists in their solution with linesearch Newton-CG method; to our knowledge, this case has not been addressed in the literature. We provide two linesearch Newton-CG methods suitable for the class of problems specified above and provide bounds on the expected number of iterations required to reach a desired level of accuracy in the optimality gap.
The paper is organized as follows. In Sect. 2 we give preliminaries on Newton-CG and on the problems considered. In Sect. 3 we present and study a linesearch Newton-CG algorithm where function estimates are subject to a prefixed deterministic noise. In Sect. 4 we propose and study a linesearch Newton-CG algorithm where function estimates have controllable accuracy. In Sect. 5 we consider the specific case where f is a finite-sum which is typical of machine learning applications and compare our approach with the Inexact Newton methods specially designed for this class of problems given in [7, 12, 33].
In the rest of paper \(\Vert \cdot \Vert \) denotes the 2-norm. Given symmetric matrices A and B, \(A\preceq B\) means that \(B-A\) is positive semidefinite.
2 Our setting
In this section we provide preliminaries on the solution of problem (1) and the assumptions made. Our methods belong to the class of the Inexact Newton methods [18] combined with a linesearch strategy for enhancing convergence properties. A key feature is that function, gradient and Hessian evaluations are approximated and the errors in such approximations are either deterministic or stochastic, as specified below.
The Inexact Newton methods considered here are iterative processes where, given the current iterate \(x_k\), a random approximation \(g_k\) to \(\nabla f(x_k)\) and a random approximation \(H_k\) to \(\nabla ^2 f(x_k)\), the trial step \(s_k\) satisfies
for some \(\eta _k\in (0,{\bar{\eta }}), 0<{\bar{\eta }}<1\), named forcing term.
With \(s_k\) and a trial steplength \(t_k\) at hand, some suitable sufficient decrease Armijo condition is tested on \(x_k+t_k s_k\). Standard linesearch strategies are applied using the true function f. On the other hand, here we assume that the evaluation of f is subject to an error.
If \(H_k\) is positive definite we can solve inexactly the linear systems \( H_k s=- g_k\) using the Conjugate Gradient (CG) method [25]. The resulting method is denoted as Newton-CG. If the initial guess for CG is the null vector, the following properties hold.
Lemma 2.1
Suppose that \(H_k\) is symmetric positive definite and \( s_k \) is the vector in (2) obtained by applying the CG method with null initial guess to the linear system \( H_k s=- g_k\). Let \(0<\lambda _1\le \lambda _n\) such that
Then, there exist constants \(\kappa _1, \kappa _2, \beta >0\), such that:
which satisfy
As a consequence
Proof
Lemma 7 in [21] guarantees that any step \({\hat{s}}\) returned by the CG method applied to \(H_ks=g_k\) with null initial guess satisfies \({\hat{s}}^TH_k{\hat{s}}=-{\hat{s}}^Tg_k\). Then, it holds
and by (3) we have
which provides the lower bound on \(\beta \) in (5).
By (11) it also follows that
which provides the upper bound on \(\kappa _2\) in (5).
Moreover, using (2)
and consequently
which provides the lower bound on \(\kappa _1\) in (5).
Inequalities (6)–(9) are direct consequences of (4) and (5). \(\square \)
2.1 Assumptions
We introduce the assumptions on the problem (1) and on the approximate evaluations of functions, gradients and Hessians.
Assumption 2.2
(smoothness and strong convexity of f) The function f is twice continuously differentiable and there exist some \(\lambda _n\ge \lambda _1 > 0 \) such that the Hessian matrix \(\nabla ^2 f(x)\) satisfy
As a consequence, f is strongly convex with constant \(\lambda _1\), i.e.,
and the gradient of f is Lipschitz-continuous with constant \(\lambda _n\), i.e.,
Further, letting \( x^* \) be the unique minimizer of the function f, for all \( x \in {\mathbb {R}}^n \) we have
and
see [29, Theorem 2.10].
As for approximated evaluations of the objective function, we consider two possible cases. The first one is such that the value f(x) is approximated with a value \({{\tilde{f}}}(x)\) and the corresponding error is not controllable but its upper bound \(\varepsilon _f\) is known.
Assumption 2.3
(boundness of noise in f) There exists a positive scalar \(\varepsilon _f\) such that
The second case concerns a controllable error between f(x) and \({{\tilde{f}}}(x)\) at the current iteration \(x_k\) and at the trial iteration \(x_k+t_ks_k\).
Assumption 2.4
(controllable noise in f) For all \(k>0\) and some given positive \(\theta \)
The methods we are dealing with are globalized Inexact Newton methods employing random estimates \(g_k\) and \(H_k\) of the gradient and the Hessian and noisy values of the objective function. Then, they generate a stochastic process. We denote the random variables of our process as follows: the gradient estimator \( G_k\), the hessian estimator \({\mathcal {H}}_k\), the step size parameter \( \mathcal T_k\), the search direction \({{{\mathcal {S}}}}_k\), the iterate \(X_k\). Their realizations are denoted as \(g_k = G_k(\omega _k), H_k = \mathcal H_k(\omega _k), t_k = {\mathcal {T}}_k(\omega _k)\), \(s_k = {{{\mathcal {S}}}}_k(\omega _k)\) and \(x_k=X_k(\omega _k)\), respectively, with \(\omega _k\) taken from a proper probability space. For brevity we will omit \(\omega _k\) in the following. We let \({\mathcal {E}}_{k-1}\) denote all noise history up to iteration \(k - 1\) and we include \({\mathcal {E}}_{k-1}\) in the algorithmic history for completeness. We use \({\mathcal {F}}_{k-1} = \sigma (G_0,\ldots , G_{k-1}, \mathcal H_0,\ldots ,{\mathcal {H}}_{k-1},{\mathcal {E}}_{k-1})\) to denote the \(\sigma \)-algebra generated by \(G_0, \ldots , G_{k-1}, \mathcal H_0,\ldots ,{\mathcal {H}}_{k-1}\) and \({\mathcal {E}}_{k-1}\), up to the beginning of iteration k.
We assume that the random gradient estimators \(G_k\) are \((1-\delta _g)\)-probabilistically sufficiently accurate.
Assumption 2.5
(gradient estimate) The estimator \(G_k\) is \((1-\delta _g)\)-probabilistically sufficient accurate in the sense that the indicator variable
satisfies the submartingale condition
Iteration k is called true (iteration) if \(I_k=1\), false otherwise. Trivially, if kth iteration is true then the triangle inequality implies
Finally, at each iteration and for any realization, the approximation \(H_k\) is supposed to be positive definite.
Assumption 2.6
For all \(k\ge 0\), \(H_k\) is symmetric positive definite and
with \(\lambda _1, \, {\lambda _n}\) as in (14).
We remark that assuming f strictly convex, (14) and (23) hold, with suitable choices of the scalar \(\lambda _1, \, \lambda _n\), as long as the sequence \(\{H_k\}\) is symmetric positive definite and has eigenvalues uniformly bounded from below and above.
Some comments on Assumptions 2.3, 2.4, 2.5 and 2.6 are in order. They appear in a series of papers on unconstrained optimization where the evaluation of function, gradient and Hessian are inexact, either with a controllable or a random noise. Controlling the noise in a deterministic way, as in Assumptions 2.3 and 2.4 is a realistic request in applications such as those where the accuracy of f-values can be enforced by the magnitude of some discretization parameters or f is evaluated in variable precision or approximated by using smoothing operators [24, 27, 28, 30]. Probabilistically sufficient accurate gradients, as in Assumption 2.5, occur when the gradients are estimated by finite difference and some computation fails to complete, in derivative-free optimization or when the gradient are estimated by sample average approximation methods [1, 6, 16, 17]. Finally, Assumption 2.6 amounts to building a convex random model and is trivially enforced if f is the sum of strongly convex functions and the Hessian is estimated by sample average approximation methods. In literature, unconstrained optimization with inexact function and derivative evaluations covers many cases: exact function and gradient evaluations and possibly random Hessian [3, 13, 35], exact function and random gradient and Hessian [1, 2, 16], approximated function and random gradient and Hessian [4, 10], random function gradient and Hessian [5, 8, 11, 17, 32]. In the class of Inexact Newton method with random models we mention [7, 12, 14, 19, 26].
3 Bounded noise on f
In this section we present and analyze an Inexact Newton method with line-search where the function evaluation is noisy in the sense of Assumption 2.3.
At iteration k, given \(x_k\) and the steplength \(t_k\), a non-monotone Armijo condition given in [10] is used. It employs the known upper bound \(\varepsilon _f\) introduced in Assumption 2.3 and has the form
\(c\in (0,1)\). If \(x_k+t_ks_k\) satisfies (24) we say that the iteration is successful, we accept the step and increase the step-length \(t_k\) for the next iteration. Otherwise the step is rejected and the step-length \(t_k\) is reduced; at the next iterate \(g_k\) and \(H_k\) are supposed to be computed from scratch. Our procedure belongs to the framework given in Section 4.2 of [10] and it is sketched in Algorithm 3.1.
Similarly to [10, 16], Algorithm 3.1 generates a stochastic process. Given \(x_{k}\) and \(t_{k}\), the iterate \(x_{k+1}\) is fully determined by \(g_{k}\), \(H_k\) and the noise in the function value estimation during iteration k.
Concerning the well definiteness of the linesearch strategy (24), we now prove that if the iteration is true and \(t_k\) is small enough, the linesearch condition is satisfied.
Lemma 3.1
Suppose that Assumptions 2.2, 2.3, 2.5 and 2.6 hold. Suppose that iteration k is true and consider any realization of Algorithm 3.1. Then the iteration is successful whenever \(t_k \le \bar{t}= \frac{2\beta \kappa _1(1-c)}{\kappa _1\lambda _n+2}\).
Proof
Let k be an arbitrary iteration. Inequalities (15) and (18) imply, using the standard arguments for functions with bounded Hessians,
Since iteration k is true by assumption then \(\Vert \nabla f(x_k)- g_k\Vert \le t_k \eta _k \Vert g_k\Vert \) holds, and using Lemma 2.1 we obtain
Then, the linesearch condition (24) is clearly enforced whenever
which gives
Using (4) we have \(-(1-c) g_k^Ts_k \ge (1-c) \beta \Vert s_k \Vert ^2 \). Since \(\eta _k<{\bar{\eta }}<1\), if
then (24) holds and this yields the thesis. \(\square \)
3.1 Complexity analysis of the stochastic process
In this section we carry out the convergence analysis of Algorithm 3.1. To this end we provide a bound on the expected number of iterations that the algorithm takes before it achieves a desired level of accuracy in the optimality gap \(f(x_k)-f^*\) with \(f^*=f(x^*)\) being the minimum value attained by f. Such a number of iteration is defined formally below.
Definition 3.2
Let \(x^*\) be the global minimizer of f and \(f^* = f(x^*)\). Given some \(\epsilon >0\), \(N_\epsilon \) is the number of iterations required until \(f(x_k) - f^* \le \epsilon \) occurs for the first time.
The number of iterations \(N_\epsilon \) is a random variable and it can be defined as the hitting time for our stochastic process. Indeed it has the property \(\sigma ({\mathbbm {1}} \{N_\epsilon > k\}) \subset {\mathcal {F}}_{k-1}\).
Following the notation introduced in Sect. 2 we let \(X_k\), \(k\ge 0\), be the random variable with realization \(x_k=X_k(\omega _k)\) and consider the following measure of progress towards optimality:
Further, we let
be an upper bound for \(Z_k\) for any \(k<N_\epsilon \). We denote with \(z_k=Z_k(\omega _k)\) a realization of the random quantity \(Z_k\).
A theoretical framework for analyzing a generic line search with noise has been developed in [10]. Under a suitable set of conditions, it provides the expected value for \(N_\epsilon \). We state a result from [10] and will exploit it for our algorithm.
Theorem 3.3
Suppose that Assumptions 2.2, 2.3, 2.5, 2.6 hold. Let \(z_k\) a realization of \(Z_k\) in (26) and suppose that there exist a constant \({\bar{t}} > 0\), a nondecreasing function \(h(t) : {\mathbb {R}}^+\rightarrow {\mathbb {R}}\), which satisfies \(h(t) > 0\) for any \(t \in (0,t_{\max }]\), and a nondecreasing function \(r(\varepsilon _f ) : {\mathbb {R}}\rightarrow {\mathbb {R}}\), which satisfies \(r(\varepsilon _f ) \ge 0\) for any \(\varepsilon _f \ge 0\), such that for any realization of Algorithm 3.1 the following holds for all \(k < N_\epsilon \):
-
(i)
If iteration k is true and successful, then \(z_{k+1} \ge z_k +h(t_k)- r(\varepsilon _f )\).
-
(ii)
If \(t_k \le {\bar{t}}\) and iteration k is true then iteration k is also successful, which implies \(t_{k+1} = \tau ^{-1}t_k\).
-
(iii)
\(z_{k+1} \ge z_k - r(\varepsilon _f )\) for all successful iterations k and \(z_{k+1} \ge z_k\) for all unsuccessful iteration k.
-
(iv)
The ratio \(r(\varepsilon _f )/h({\bar{t}})\) is bounded from above by some \(\gamma \in (0, 1)\).
Then under the condition that the probability \(\delta _g\) in Assumption 2.5 is such that \(\delta _g < \frac{1}{2} - \frac{\sqrt{\gamma }}{2} \), the stopping time \(N_\epsilon \) is bounded in expectation as follows
Proof
See [10, Assumption 3.3 and Theorem 3.13]. \(\square \)
We show that our algorithm satisfies the assumptions in Theorem 3.3 if the magnitude of \(\epsilon \) fulfills the following condition.
Assumption 3.4
Let \(c\in (0,1)\) as in (24), \(\beta \), \(\kappa _1\) as in Lemma 2.1, \(\lambda _1\) as in Assumption 2.2, \({\bar{t}}\) as in Lemma 3.1, \(t_{\max }\) as in Algorithm 3.1. Assume that \(\epsilon \) in Definition 3.2 is such that
where \(M=\frac{c\beta \kappa _1^2\lambda _1 {\bar{t}}}{(1+t_{\max })^2}\), for some \(\gamma \in (0,1)\) such that \((1-M)^{-\gamma }<2\).
Note that \(M\in (0,1)\) due to the definition of \({\bar{t}}\) in Lemma 3.1 and the smaller M is, the larger is \(\epsilon \) with respect to \(\varepsilon _f\).
First, we provide a relation between \(z_k\) and \(z_{k+1}\) of the form specified in item (i), Theorem 3.3.
Lemma 3.5
Suppose that Assumptions 2.2, 2.3, 2.5, 2.6 and 3.4 hold. Consider any realization of Algorithm 3.1. If the k-th iterate is true and successful, then
whenever \(k<N_{\epsilon }\).
Proof
By (17), \(f(x_k)-f^* \le \frac{1}{2\lambda _1}\Vert \nabla f(x_k) \Vert ^2\). Using (22)
Combining condition (18), (24) and Lemma 2.1 it holds
and thus, using (31),
Then it holds
We define \(\Delta _k^f = f(x_k) - f^*\). Because of \(f(x_k)-f^*>\epsilon \), we have
where the second inequalities holds thanks to Assumption 3.4, because \(4\varepsilon _f <\epsilon \) and the last one holds since \(t_k\le t_{\max }\) and \(\eta _k< 1\).
Notice that since \(\left( 1+ \frac{4\varepsilon _f}{\epsilon }\right) >~0\), \(\Delta _k^f > 0\) and \(\Delta _{k+1}^f \ge 0\), it holds \(\left( 1-c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_{\max })^2}\right) \ge ~0\). Now, taking the inverse and then the \(\log \) of both sides, adding \(\log \Delta _0^f\), we have
which completes the proof.
\(\square \)
The next lemma analyzes item (ii) of Theorem 3.3
Lemma 3.6
Suppose that Assumptions 2.2, 2.3, 2.5 and 2.6 hold. Consider any realization of Algorithm 3.1. For every iteration that is false and successful, we have
Moreover \(z_{k+1}=z_k\) for any unsuccessful iteration.
Proof
For every false and successful iteration, using (18), (24) and (4) we have
thus, because of \(f(x_k)-f^*>\epsilon \),
So it holds \(\Delta _{k+1}^f \le \left( 1+\frac{4\varepsilon _f}{\epsilon } \right) \Delta _k^f\). Now taking the inverse and then the log of both sides, adding \(\log \Delta _0^f\) we have
which completes the first part of the proof. Finally, for any unsuccessful iteration \(z_{k+1}=z_k\) follows by Step 3 of Algorithm 3.1 that provides \(x_{k+1}=x_k\) and hence \(f(x_{k+1})=f(x_k)\). \(\square \)
We can now summarize our results. First, note that \(\left( 1-c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_{\max })^2}\right) \ge 0\) for all \(t_k\in [0,t_{\max }]\), due to (6) and (8). Second, let
It is easy to see that h(t) is monotone and non increasing if \(t\in [0,t_{\max }]\).
Combining Lemmas 3.1, 3.5 and 3.6 , we have that for any realization of Algorithm 3.1 and \(k< N_\epsilon \) with \(\epsilon \) as in Assumption 3.4:
-
(i)
(Lemma 3.5) If iteration k is true and successful, then \(z_{k+1} \ge z_k +h(t_k)- r(\varepsilon _f )\).
-
(ii)
(Lemma 3.1) If \(t_k \le {\bar{t}}\) and iteration k is true then iteration k is also successful, which implies \(t_{k+1} = \tau ^{-1}t_k\).
-
(iii)
(Lemma 3.6) \(z_{k+1} \ge z_k - r(\varepsilon _f )\) for all successful iterations k and \(z_{k+1} = z_k\) for all unsuccessful iteration k.
-
(iv)
(Assumption 3.4) The ratio \(r(\varepsilon _f )/h({\bar{t}})\) is bounded from above by some \(\gamma \in ~(0, 1)\).
Hence, we can use Theorem 3.3 and get the following boun on \({{\mathbb {E}}}[N_\epsilon ]\),
with M given in Assumption 3.4. This result is valid under Assumption 3.4, namely for sufficiently large values of \(\epsilon \). The fact that \(\epsilon \) cannot be arbitrarily small is consistent with the presence of noise \(\varepsilon _f\) in f-evaluations. Trivially, if \(\varepsilon _f=0\) then the optimality gap \(f(x_k)-f^*\) can be made arbitrarily small.
4 Decreasing noise on f
In this section we present an Inexact Newton algorithm suitable to the case where f-evaluations can be performed with adaptive accuracy. Letting \(c \in (0,\frac{1}{2})\), we use the linesearch condition
where \({{\tilde{f}}}(x_k)\) and \({{\tilde{f}}}(x_k + t_k s_k)\) satisfy Assumption 2.4 with \(\theta < \frac{c}{2}\). In fact, (34) has the form of the classical Armijo condition but the true f is replaced by the approximation \({{\tilde{f}}}\).
The resulting algorithm is given below.
The following Lemma shows that a successful iteration is guaranteed whenever it is true and \(t_k\) is sufficiently small.
Lemma 4.1
Suppose that Assumptions 2.2, 2.4 with \(\theta <\frac{c}{ 2}\), 2.5 and 2.6 hold. Suppose that iteration k is true and consider any realization of Algorithm 4.1. Then the iteration is successful whenever \(t_k\le {\bar{t}} = \frac{2\kappa _1\beta }{\kappa _1\lambda _n+2}(1-c-2\theta )\).
Proof
Using the same arguments as in Lemma 3.1, using (19) and (34), rather than (18) and (24) we obtain
The linesearch condition (34) is clearly enforced whenever
which gives
Note that \(1-c-2\theta >0\) by \(c\in (0,1/2)\) and \(\theta \in (0, \frac{c}{2})\). Using (4) we have \(-(1-c-2\theta ) g_k^Ts_k \ge (1-c-2\theta ) \beta \Vert s_k \Vert ^2 \). Then, since \(\eta _k<\bar{\eta }<1\), (34) holds if
and this yields the thesis. \(\square \)
4.1 Complexity analysis of the stochastic process
The behaviour of the method is studied analyzing the hitting time \(N_\epsilon \) in Definition 3.2. In particular, we first show the following two results on the realization \(z_k\) of the variable \(Z_k\) in (26).
Lemma 4.2
Suppose that Assumptions 2.2, 2.4 with \(\theta <\frac{c}{ 2}\), 2.5 and 2.6 hold. If the k-th iterate of Algorithm 4.1 is true and successful, for any realization of the Algorithm 4.1 we have
whenever \(k<N_{\epsilon }\).
Proof
Using the same arguments as in Lemma 3.5, using (19) and (34), rather than (18) and (24) we obtain
where the second inequality comes from (4). Thus
and using (31) we get
Now proceeding as in Lemma 3.5 we have the thesis. \(\square \)
Lemma 4.3
Suppose that Assumptions 2.2, 2.4 with \(\theta <\frac{c}{ 2}\), 2.5 and 2.6 hold. For any realization of Algorithm 4.1 we have
if the iteration k is false and successful,
if the iteration k is unsuccessful.
Proof
For every false and successful iteration, using (19) and (34),we have
and in case of unsuccessful iteration Step 4 of the algorithm provides \(x_{k+1}=x_k\). Then, due to the definition of \(Z_k\) in (26) the thesis follows.
\(\square \)
Now we can state the main result on the expected value for the hitting time.
Theorem 4.4
Suppose that Assumptions 2.2, 2.4 with \(\theta <\frac{c}{2}\), 2.5 and 2.6 hold and let \(\bar{t}\) given in Lemma 4.1. Then under the condition that the probability \(\delta _g\) in Assumption 2.5 is such that \(\delta _g < \frac{1}{2} \), the stopping time \(N_\epsilon \) is bounded in expectation as follows
with \(M=\frac{2(c-2\theta )\beta \kappa _1^2\lambda _1 \bar{t}}{(1+t_{\max })^2}\).
Proof
Let
and note that h(t) is non decreasing for \(t\in [0,t_{\max }]\) and that \(h(t)> 0\) for \(t\in [0,t_{\max }]\). For any realization \(z_k\) of \(Z_k\) in (26) of Algorithm 4.1 the following hold for all \(k < N_\epsilon \):
-
(i)
If iteration k is true and successful, then \(z_{k+1} \ge z_k +h(t_k)\) by Lemma 4.2.
-
(ii)
If \(t_k \le {\bar{t}}\) and iteration k is true then iteration k is also successful, which implies \(t_{k+1} = \tau ^{-1}t_k\) by Lemma 4.1.
-
(iii)
\(z_{k+1} \ge z_k \) for all successful iterations k (\(z_{k+1} = z_k\) for all unsuccessful iteration k), by Lemma 4.3.
Moreover, our stochastic process \(\{{{\mathcal {T}}_k}, Z_k\}\) obeys the expressions below. By Lemma 4.1 and the definition of Algorithm 4.1 the update of the random variable \(\mathcal T_k \) such that \(t_k = {\mathcal {T}}_k(\omega _k)\) is
where the event \(I_k\) is defined in 20. By Lemmas 4.1, 4.2 and 4.3 the random variable \(Z_k \) obeys the expression
Then Lemma 2.2–Lemma 2.7 and Theorem 2.1 in [16] hold which gives the thesis. \(\square \)
4.2 Local convergence
We conclude our study analyzing the local behaviour of the Newton-CG method employing gradient estimates \((1-\delta _g)\)-probabilistically sufficiently accurate, i.e. satisfying Assumption 2.5 and Hessian estimates satisfying the following assumption.
Assumption 4.5
The Hessian of the objective function f is Lipschitz-continuous with constant \(L_H>0\),
Given a constant \(C>0\), the Hessian estimator is \((1-\delta _H)\)-probabilistically sufficiently accurate in the sense that the indicator variable
satisfies the submartingale condition
We let \(t_{\max }=1\), so that the maximum step-size gives the full CG step \(s_k\).
The following lemma shows that if the full CG step \(s_k\) is accepted then the error linearly decreases with a certain probability. Further, the same occurrence over \(\ell \) successive iterations is analyzed.
Lemma 4.6
Suppose that Assumptions 2.2, 2.4 with \(\theta <\frac{c}{2}\), 2.5, 2.6 and 4.5 hold. Let \(x_{{\bar{k}}}\) be a realization of Algorithm 4.1 with \(t_{{\bar{k}}}=1\). Assume that the iteration is successful and \(\Vert x_{{\bar{k}}}-x^*\Vert \) and \(\eta _{{\bar{k}}}\) are sufficiently small so that \(\frac{1}{\lambda _1}\left[ \frac{L_H}{2} \Vert x_{{\bar{k}}}-x^*\Vert + C \eta _{{\bar{k}}} + \frac{2\lambda _n\eta _{{\bar{k}}}}{1-{\bar{\eta }}} \right]<~{{\tilde{C}}}<~1\). Then, at least with probability \(p=(1-\delta _g)(1-\delta _H)\), it holds
If \(\{ \eta _k \}\) is a non-increasing sequence and the iterations \({\bar{k}},\ldots ,{\bar{k}}+\ell -1\) are successful with \(t_k=1\) for \(k={\bar{k}},\ldots ,{\bar{k}}+\ell -1\), then it holds \(\Vert x_{ k+1} -x^* \Vert < \Vert x_{ k} -x^* \Vert \) for \(k={\bar{k}},\ldots ,{\bar{k}}+\ell -1\), at least with probability \(p^l\).
Proof
Thanks to (37) it holds
Let us assume that both the events \(I_k \) and \(J_k\) are true. Then, \(\Vert g_k-\nabla f(x_k)\Vert \le \eta _k \Vert g_k\Vert \),
and by (2)
Moreover,
i.e, \(\Vert g_k \Vert \le \frac{1}{1-\eta _k}\Vert \nabla f(x_k) \Vert \). Then combining with (16) we have
Therefore
Then, since \(P(I_k\cap J_k)\ge p\) it follows
at least with probability p.
Therefore, since at iteration \({\bar{k}}\), \(\frac{1}{\lambda _1}\left[ \frac{L_H}{2} \Vert x_{{\bar{k}}}-x^*\Vert + C \eta _{{\bar{k}}} + \frac{2\lambda _n\eta _{{\bar{k}}}}{1-{\bar{\eta }}} \right]<~{{\tilde{C}}}<~1\) by assumption, it follows \(\Vert x_{\bar{k}+1}-x^* \Vert < {{\tilde{C}}} \Vert x_{{\bar{k}}}-x^* \Vert \).
At iteration \({\bar{k}}+1\), \(t_{{\bar{k}}+1}=1\) and the iteration is successful by hypothesis. Then, we can repeat the previous arguments and the thesis follows. \(\square \)
5 Finite sum case
In this section we consider the finite-sum minimization problem that arises in machine learning and data analysis:
The objective function f is the mean of N component functions \(f_i :{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) and for large values of N, the exact evaluation of the function and derivatives might be computationally expensive. We suppose that each \(f_i\) is strongly convex.
Following [1, 2, 16] f is evaluated exactly while the approximations \(g_k\) and \(H_k\) to the gradient and the Hessian respectively satisfy accuracy requirements in probability.
The evaluations of \(g_k\) and \(H_k\) can be made using subsampling, that means picking randomly and uniformly chosen subsets of indexes \({{{\mathcal {N}}}}_{g,k}\) and \({{{\mathcal {N}}}}_{H,k}\) from \({{{\mathcal {N}}}}=\{ 1,\ldots ,N \}\) and define
If \(g_k\) and \(H_k\) are required to be probabilistically sufficiently accurate as in Definition 2.5 and in Assumption 4.5 respectively, the sample sizes \(|{{{\mathcal {N}}}}_{g,k} |\) and \(|{{{\mathcal {N}}}}_{H,k} |\) can be determined by using the operator-Bernstein inequality introduced in [34]. As shown in [6], \(g_k\) and \(H_k\) are \((1-\delta _g)\) and \((1-\delta _H)\) -probabilistically sufficiently accurate if
where \(\gamma _{g,k}\) is an approximation of the required gradient accuracy, namely \(\gamma _{g,k}\approx ~ t_k\eta _k \Vert G_k \Vert \) and under the assumption that, for any \(x\in {\mathbb {R}}^n\), there exist non-negative upper bounds \(\kappa _{f,g}\) and \(\kappa _{f,H}\) such that
A practical version of the procedure is shown in Algorithm 5.1. Gradient approximation requires a loop since the accuracy requirement is implicit; such a strategy is Step 2 of the following algorithm.
Inexact Newton methods for the finite-sum minimization problems are investigated also in [7, 12, 33]. In [7] it is analyzed a linesearch Newton-CG method where the objective function and the gradient are approximated by subsampling with increasing samplesizes determined by a prefixed rule. Random estimates of the Hessian with adaptive accuracy requirements as in Assumption 4.5 are employed and local convergence results in the mean square are given. In [12] the local convergence of Inexact Newton method is studied assuming to use prefixed choice of the sample size used to estimate by subsampling both the gradient and the Hessian. The paper [33] studies the global as well as local convergence behavior of linesearch Inexact Newton algorithms, where the objective function is exact and the Hessian and/or gradient are sub-sampled. A high probability analysis of the local convergence of the method is given, whereas we prove complexity results is expectation with noise in the objective function. Moreover, the estimators \(g_k\) and \(H_k\) are supposed to be \((1-\delta _g)\) and \((1-\delta _H)\) -probabilistic sufficiently accurate as in our approach but with different accuracy requirements. Predetermined and increasing accuracy requirements are used in [33] rather than the adaptive accuracy requirements in Assumption 2.5 and in Assumption 4.5.
6 Conclusion
In this paper we presented three Inexact Newton-CG methods with linesearch suitable for strongly convex functions with deterministic noise. Two type of noise, bounded noise and controllable noise on the objective function were considered. Regarding gradients, random approximations were allowed and their accuracy was supposed to be sufficiently high with a certain probability. The Hessians were possiby approximated by means of positive definite matrices.
We presented algorithms for the above two cases of noise on the objective function and analyzed the iteration complexity of the stochastic processes generated. In particular, we established a bound on the expected number of iterations that the algorithms take until the optimality gap reaches a desired accuracy for the first time. Successively, we studied the local behavior of the algorithm with controllable noise on the objective function and random approximations of the Hessian sufficiently accurate with a certain probability. Finally, the discussion was specialized to the case where f is a finite-sum of strongly convex function.
References
Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization 24(3), 1238–1264 (2014)
Bellavia, S., Gurioli, G.: Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic Hessian accuracy. Optimization 71, 227–261 (2022)
Bellavia, S., Gurioli, G., Morini, B.: Adaptive cubic regularization methods with dynamic inexact Hessian information and applications to finite-sum minimization. IMA Journal of Numerical Analysis 41(1), 764–799 (2021)
Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L.: Adaptive regularization for nonconvex optimization using inexact function values and randomly perturbed derivatives. Journal of Complexity 68, 101591 (2022)
Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L.: Trust-region algorithms: probabilistic complexity and intrinsic noise with applications to subsampling techniques, arXiv preprint arXiv:2112.06176, (2021)
Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L.: Adaptive Regularization Algorithms with Inexact Evaluations for Nonconvex Optimization. SIAM Journal on Optimization 29(4), 2281–2915 (2019)
Bellavia, S., Krejic, N., Krklec Jerinkic, N.: Subsampled Inexact Newton methods for minimizing large sums of convex functions. IMA Journal of Numerical Analysis 40, 2309–2341 (2020)
Bellavia, S., Krejić, N., Morini, B.: Inexact restoration with subsampled trust-region methods for finite-sum minimization. Computational Optimization and Applications 76, 701–736 (2020)
Berahas, A.S., Bollapragada, R., Nocedal, J.: An Investigation of Newton-Sketch and Subsampled Newton Methods. Optimization Methods and Software 35, 661–680 (2020)
Berahas, A. S., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise, SIAM Journal on Optimization, (2019)
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence Rate Analysis of a Stochastic Trust Region Method via Submartingales. INFORMS Journal on Optimization 1, 92–119 (2019)
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and Inexact Subsampled Newton Methods for Optimization, IMA Journal Numerical Analysis, (2018)
Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Mathematical Programming 134(1), 127–155 (2012)
Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM Journal on Optimization 21(3), 977–995 (2011)
Carter, R.G.: On the global convergence of trust-region algorithms using inexact gradient information. SIAM Journal of Numerical Analysis 28, 251–265 (1991)
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic model. Mathematical Programming 169(2), 337–375 (2017)
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Mathematical Programming 169(2), 447–487 (2018)
Dembo, R.S., Eisenstat, S.C., Steinhaug, T.: Inexact Newton method. SIAM Journal on Numerical Analysis 19(2), 400–409 (1982)
di Serafino, D., Krejic, N., Krklec Jerinkic, N., Viola, M.: LSOS: Line-search Second-Order Stochastic optimization methods for nonconvex finite sums, arXiv:2007.15966v2, (2021)
Eisenstat, S.C., Walker, H.F.: Choosing the Forcing Terms in an Inexact Newton Method. SIAM Journal on Scientific Computing 17(1), 16–32 (1996)
Fountoulakis, K., Gondzio, J.: A second order method for strongly convex \( \ell _1 - \) regularization problems. Mathematical Programming 156, 189–219 (2016)
Franchini, G., Ruggiero, V., Zanni, L.: Ritz-like values in steplength selections for stochastic gradient methods. Soft Computing 24, 17573–17588 (2020)
Franchini, G., Porta, F., Ruggiero, V., Trombini, I.: A line search based proximal stochastic gradient algorithm with dynamical variance reduction, Optimization On Line, http://www.optimization-online.org/DB_HTML/2022/02/8810.html, (2022)
Gratton, S., Toint, Ph.L.: A note on solving nonlinear optimization problems in variable precision. Computational Optimization and Applications 76, 917–933 (2020)
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49(6), 409–436 (1952)
Liu, Y., Roosta, F.: Convergence of Newton-MR under inexact hessian information. SIAM Journal on Optimization 31(1), 59–90 (2021)
Maggiar, A., Wachter, A., Dolinskaya, I.S., Staum, J.: A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization 28, 1478–1507 (2018)
More, J.J., Wild, S.M.: Estimating computational noise. SIAM Journal on Scientific Computing 33, 1292–1314 (2011)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer Science and Media (2013)
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17, 527–566 (2017)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research, Springer (1999)
Paquette, C., Scheinberg, K.: A Stochastic Line Search Method with Expected Complexity Analysis. SIAM Journal of Optimization 30, 349–376 (2020)
Roosta-Khorasani, F., Mahoney, M.W.: Sub-Sampled Newton Methods. Mathematical Programming 174, 293–326 (2019)
Tropp, J.A.: An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning 8(1–2), 1–230 (2015)
Xu, P., Roosta-Khorasani, F., Mahoney, M.W.: Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information. Mathematical Programming 184, 35–70 (2020)
Acknowledgements
INdAM-GNCS partially supported the first and third authors under Progetti di Ricerca 2021.
Funding
Open access funding provided by Universitá degli Studi di Firenze within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have not conflict of interest to declare
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
S. Bellavia, E. Fabrizi, B. Morini: Member of the INdAM Research Group GNCS.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bellavia, S., Fabrizi, E. & Morini, B. Linesearch Newton-CG methods for convex optimization with noise. Ann Univ Ferrara 68, 483–504 (2022). https://doi.org/10.1007/s11565-022-00435-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11565-022-00435-4
Keywords
- Newton-CG
- Evaluation complexity
- Inexact function and derivatives
- Probabilistic analysis
- Finite-sum optimization
- Subsampling