Linesearch Newton-CG methods for convex optimization with noise

Bellavia, S.; Fabrizi, E.; Morini, B.

doi:10.1007/s11565-022-00435-4

Linesearch Newton-CG methods for convex optimization with noise

Open access
Published: 17 August 2022

Volume 68, pages 483–504, (2022)
Cite this article

Download PDF

You have full access to this open access article

ANNALI DELL'UNIVERSITA' DI FERRARA Aims and scope Submit manuscript

Linesearch Newton-CG methods for convex optimization with noise

Download PDF

1743 Accesses
Explore all metrics

Abstract

This paper studies the numerical solution of strictly convex unconstrained optimization problems by linesearch Newton-CG methods. We focus on methods employing inexact evaluations of the objective function and inexact and possibly random gradient and Hessian estimates. The derivative estimates are not required to satisfy suitable accuracy requirements at each iteration but with sufficiently high probability. Concerning the evaluation of the objective function we first assume that the noise in the objective function evaluations is bounded in absolute value. Then, we analyze the case where the error satisfies prescribed dynamic accuracy requirements. We provide for both cases a complexity analysis and derive expected iteration complexity bounds. We finally focus on the specific case of finite-sum minimization which is typical of machine learning applications.

Inexact Reduced Gradient Methods in Nonconvex Optimization

Article 19 October 2023

Globally linearly convergent nonlinear conjugate gradients without Wolfe line search

Article 09 February 2024

A filter line search algorithm based on an inexact Newton method for nonconvex equality constrained optimization

Article 01 July 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this paper we consider globally convergent Inexact Newton methods for solving the strictly convex unconstrained optimization problem

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} f(x). \end{aligned}$$

(1)

We focus on the Newton method where the linear systems are solved by the Conjugate Gradient (CG) method [25], usually denoted as Newton-CG method, and on the enhancement of its convergence properties by means of Armijo-type conditions.

The literature on globally convergent Newton-CG methods is well established as long as the gradient and the Hessian matrix are computed exactly or approximated sufficiently accurately in a deterministic way, see e.g., [15, 20, 31]. On the other hand, the research is currently very active for problems with inexact information on f and its derivatives and possibly such that the accuracy cannot be controlled in a deterministic way [1,2,3,4,5,6,7,8,9,10,11, 13, 14, 17, 19, 22, 23, 26, 32, 33, 35].

This work belongs to the recent stream of works and addresses the solution of (1) when the objective function f is computed with noise and gradient and Hessian estimates are random. Importantly, derivative estimates are not required to satisfy suitable accuracy requirements at each iteration but with sufficiently high probability. Concerning the evaluation of f we cover two cases: estimates of f subject to noise that is bounded in absolute value; estimates of f subject to a controllable error, i.e., computable with a prescribed dynamic accuracy. Such a class of problems has been considered in [4, 10] and our contribution consists in their solution with linesearch Newton-CG method; to our knowledge, this case has not been addressed in the literature. We provide two linesearch Newton-CG methods suitable for the class of problems specified above and provide bounds on the expected number of iterations required to reach a desired level of accuracy in the optimality gap.

The paper is organized as follows. In Sect. 2 we give preliminaries on Newton-CG and on the problems considered. In Sect. 3 we present and study a linesearch Newton-CG algorithm where function estimates are subject to a prefixed deterministic noise. In Sect. 4 we propose and study a linesearch Newton-CG algorithm where function estimates have controllable accuracy. In Sect. 5 we consider the specific case where f is a finite-sum which is typical of machine learning applications and compare our approach with the Inexact Newton methods specially designed for this class of problems given in [7, 12, 33].

In the rest of paper $\Vert \cdot \Vert $ denotes the 2-norm. Given symmetric matrices A and B, $A\preceq B$ means that $B-A$ is positive semidefinite.

2 Our setting

In this section we provide preliminaries on the solution of problem (1) and the assumptions made. Our methods belong to the class of the Inexact Newton methods [18] combined with a linesearch strategy for enhancing convergence properties. A key feature is that function, gradient and Hessian evaluations are approximated and the errors in such approximations are either deterministic or stochastic, as specified below.

The Inexact Newton methods considered here are iterative processes where, given the current iterate $x_k$, a random approximation $g_k$ to $\nabla f(x_k)$ and a random approximation $H_k$ to $\nabla ^2 f(x_k)$, the trial step $s_k$ satisfies

$$\begin{aligned} H_k s_k = - g_k+ r_k, \quad \Vert r_k\Vert \le \eta _k \Vert g_k\Vert , \end{aligned}$$

(2)

for some $\eta _k\in (0,{\bar{\eta }}), 0<{\bar{\eta }}<1$, named forcing term.

With $s_k$ and a trial steplength $t_k$ at hand, some suitable sufficient decrease Armijo condition is tested on $x_k+t_k s_k$. Standard linesearch strategies are applied using the true function f. On the other hand, here we assume that the evaluation of f is subject to an error.

If $H_k$ is positive definite we can solve inexactly the linear systems $ H_k s=- g_k$ using the Conjugate Gradient (CG) method [25]. The resulting method is denoted as Newton-CG. If the initial guess for CG is the null vector, the following properties hold.

Lemma 2.1

Suppose that $H_k$ is symmetric positive definite and $ s_k $ is the vector in (2) obtained by applying the CG method with null initial guess to the linear system $ H_k s=- g_k$. Let $0<\lambda _1\le \lambda _n$ such that

$$\begin{aligned} {\lambda _1 I \preceq H_k \preceq \lambda _n I. } \end{aligned}$$

(3)

Then, there exist constants $\kappa _1, \kappa _2, \beta >0$, such that:

$$\begin{aligned} \kappa _1\Vert g_k\Vert \le \Vert s_k \Vert \le \kappa _2 \Vert g_k\Vert , \quad -g_k^Ts_k \ge \beta \Vert s_k \Vert ^2, \qquad \forall k>0, \end{aligned}$$

(4)

which satisfy

$$\begin{aligned} \frac{1-{\bar{\eta }}}{\lambda _n} \le \kappa _1 \le \kappa _2\le \frac{1}{\lambda _1}, \quad \beta \ge \lambda _1. \end{aligned}$$

(5)

As a consequence

$$\begin{aligned}&\beta \kappa _1 \le \frac{-g_k^Ts_k}{\Vert s_k \Vert ^2} \frac{\Vert s_k \Vert }{\Vert g_k \Vert } = \frac{| g_k^Ts_k |}{\Vert g_k\Vert \Vert s_k \Vert }\le 1, \end{aligned}$$

(6)

$$\begin{aligned}&\lambda _n\kappa _1 \ge 1-{\bar{\eta }}, \end{aligned}$$

(7)

$$\begin{aligned}&\kappa _1 \lambda _1 \le \kappa _2 \lambda _1 \le 1, \end{aligned}$$

(8)

$$\begin{aligned}&-g_k^Ts_k \ge \beta \kappa _1^2 \Vert g_k\Vert ^2. \end{aligned}$$

(9)

Proof

Lemma 7 in [21] guarantees that any step ${\hat{s}}$ returned by the CG method applied to $H_ks=g_k$ with null initial guess satisfies ${\hat{s}}^TH_k{\hat{s}}=-{\hat{s}}^Tg_k$. Then, it holds

$$\begin{aligned} s_k^TH_ks_k = -s_k^Tg_k, \end{aligned}$$

(10)

and by (3) we have

$$\begin{aligned} - g_k^T s_k= & {} s_k^T H_k s_k \ge \lambda _1 \Vert s_k\Vert ^2 \end{aligned}$$

(11)

which provides the lower bound on $\beta $ in (5).

By (11) it also follows that

$$\begin{aligned} \Vert s_k \Vert \le \lambda _1^{-1}\Vert g_k \Vert , \end{aligned}$$

(12)

which provides the upper bound on $\kappa _2$ in (5).

Moreover, using (2)

$$\begin{aligned} \Vert g_k - r_k\Vert \ge \Vert g_k\Vert - \Vert r_k\Vert \ge (1-\eta _k) \Vert g_k\Vert \ge (1-{\bar{\eta }}) \Vert g_k\Vert \end{aligned}$$

while (2) and (3) give

$$\begin{aligned} \Vert g_k - r_k\Vert = \Vert H_k s_k\Vert \le \Vert H_k \Vert \Vert s_k\Vert \le \lambda _n \Vert s_k\Vert \end{aligned}$$

and consequently

$$\begin{aligned} \Vert s_k\Vert \ge \frac{\Vert g_k - r_k\Vert }{\lambda _n} \ge \frac{1-{\bar{\eta }}}{\lambda _n} \Vert g_k \Vert , \end{aligned}$$

(13)

which provides the lower bound on $\kappa _1$ in (5).

Inequalities (6)–(9) are direct consequences of (4) and (5). $\square $

2.1 Assumptions

We introduce the assumptions on the problem (1) and on the approximate evaluations of functions, gradients and Hessians.

Assumption 2.2

(smoothness and strong convexity of f) The function f is twice continuously differentiable and there exist some $\lambda _n\ge \lambda _1 > 0 $ such that the Hessian matrix $\nabla ^2 f(x)$ satisfy

$$\begin{aligned} \lambda _1 I \preceq \nabla ^2f(x) \preceq \lambda _n I,\;\;\; \forall x \in {{\mathbb {R}}}^n. \end{aligned}$$

(14)

As a consequence, f is strongly convex with constant $\lambda _1$, i.e.,

$$\begin{aligned} {f(x)\ge f(y) + \nabla f(y)^T(x-y) +\frac{\lambda _1}{2}\Vert x-y \Vert ^2 \text { for all } x,y\in {\mathbb {R}}^n} \end{aligned}$$

and the gradient of f is Lipschitz-continuous with constant $\lambda _n$, i.e.,

$$\begin{aligned} \Vert \nabla f(x) -\nabla f(y)\Vert \le \lambda _n\Vert x-y \Vert \text { for all } x,y\in {\mathbb {R}}^n. \end{aligned}$$

(15)

Further, letting $ x^* $ be the unique minimizer of the function f, for all $ x \in {\mathbb {R}}^n $ we have

$$\begin{aligned} \lambda _{1} \Vert x-x^*\Vert \le \Vert \nabla f(x) \Vert \le \lambda _{n} \Vert x-x^*\Vert , \end{aligned}$$

(16)

and

$$\begin{aligned} \frac{\lambda _1}{2 } \Vert x-x^{*}\Vert ^{2}\le f(x)-f(x^{*})\le \frac{1}{2\lambda _1} \Vert \nabla f(x)\Vert ^{2}, \end{aligned}$$

(17)

see [29, Theorem 2.10].

As for approximated evaluations of the objective function, we consider two possible cases. The first one is such that the value f(x) is approximated with a value ${{\tilde{f}}}(x)$ and the corresponding error is not controllable but its upper bound $\varepsilon _f$ is known.

Assumption 2.3

(boundness of noise in f) There exists a positive scalar $\varepsilon _f$ such that

$$\begin{aligned} | f(x) - {{\tilde{f}}}(x) |\le \varepsilon _f, \quad \forall x\in {\mathbb {R}}^n. \end{aligned}$$

(18)

The second case concerns a controllable error between f(x) and ${{\tilde{f}}}(x)$ at the current iteration $x_k$ and at the trial iteration $x_k+t_ks_k$.

Assumption 2.4

(controllable noise in f) For all $k>0$ and some given positive $\theta $

$$\begin{aligned} | f(x_k) - {{\tilde{f}}}(x_k) |\le & {} -\theta t_k s_k^Tg_k, \nonumber \\ | f(x_{k}+t_ks_k) - {{\tilde{f}}}(x_{k}+t_ks_k) |\le & {} -\theta t_k s_k^Tg_k. \end{aligned}$$

(19)

The methods we are dealing with are globalized Inexact Newton methods employing random estimates $g_k$ and $H_k$ of the gradient and the Hessian and noisy values of the objective function. Then, they generate a stochastic process. We denote the random variables of our process as follows: the gradient estimator $ G_k$, the hessian estimator ${\mathcal {H}}_k$, the step size parameter $ \mathcal T_k$, the search direction ${{{\mathcal {S}}}}_k$, the iterate $X_k$. Their realizations are denoted as $g_k = G_k(\omega _k), H_k = \mathcal H_k(\omega _k), t_k = {\mathcal {T}}_k(\omega _k)$, $s_k = {{{\mathcal {S}}}}_k(\omega _k)$ and $x_k=X_k(\omega _k)$, respectively, with $\omega _k$ taken from a proper probability space. For brevity we will omit $\omega _k$ in the following. We let ${\mathcal {E}}_{k-1}$ denote all noise history up to iteration $k - 1$ and we include ${\mathcal {E}}_{k-1}$ in the algorithmic history for completeness. We use ${\mathcal {F}}_{k-1} = \sigma (G_0,\ldots , G_{k-1}, \mathcal H_0,\ldots ,{\mathcal {H}}_{k-1},{\mathcal {E}}_{k-1})$ to denote the $\sigma $-algebra generated by $G_0, \ldots , G_{k-1}, \mathcal H_0,\ldots ,{\mathcal {H}}_{k-1}$ and ${\mathcal {E}}_{k-1}$, up to the beginning of iteration k.

We assume that the random gradient estimators $G_k$ are $(1-\delta _g)$-probabilistically sufficiently accurate.

Assumption 2.5

(gradient estimate) The estimator $G_k$ is $(1-\delta _g)$-probabilistically sufficient accurate in the sense that the indicator variable

$$\begin{aligned} I_k = {\mathbbm {1}} \{ \Vert G_k - \nabla f(X_k) \Vert \le {t_k\eta _k} \Vert G_k \Vert \} \end{aligned}$$

(20)

satisfies the submartingale condition

$$\begin{aligned} P\left( I_k=1 | {\mathcal {F}}_{k-1} \right) \ge 1 - \delta _g, \quad \delta _g\in (0,1). \end{aligned}$$

(21)

Iteration k is called true (iteration) if $I_k=1$, false otherwise. Trivially, if kth iteration is true then the triangle inequality implies

$$\begin{aligned} \Vert \nabla f(x_k) \Vert \le \left( {1+t_k\eta _k} \right) \Vert g_k \Vert . \end{aligned}$$

(22)

Finally, at each iteration and for any realization, the approximation $H_k$ is supposed to be positive definite.

Assumption 2.6

For all $k\ge 0$, $H_k$ is symmetric positive definite and

$$\begin{aligned} \lambda _1 I \preceq H_k \preceq \lambda _n I, \end{aligned}$$

(23)

with $\lambda _1, \, {\lambda _n}$ as in (14).

We remark that assuming f strictly convex, (14) and (23) hold, with suitable choices of the scalar $\lambda _1, \, \lambda _n$, as long as the sequence $\{H_k\}$ is symmetric positive definite and has eigenvalues uniformly bounded from below and above.

Some comments on Assumptions 2.3, 2.4, 2.5 and 2.6 are in order. They appear in a series of papers on unconstrained optimization where the evaluation of function, gradient and Hessian are inexact, either with a controllable or a random noise. Controlling the noise in a deterministic way, as in Assumptions 2.3 and 2.4 is a realistic request in applications such as those where the accuracy of f-values can be enforced by the magnitude of some discretization parameters or f is evaluated in variable precision or approximated by using smoothing operators [24, 27, 28, 30]. Probabilistically sufficient accurate gradients, as in Assumption 2.5, occur when the gradients are estimated by finite difference and some computation fails to complete, in derivative-free optimization or when the gradient are estimated by sample average approximation methods [1, 6, 16, 17]. Finally, Assumption 2.6 amounts to building a convex random model and is trivially enforced if f is the sum of strongly convex functions and the Hessian is estimated by sample average approximation methods. In literature, unconstrained optimization with inexact function and derivative evaluations covers many cases: exact function and gradient evaluations and possibly random Hessian [3, 13, 35], exact function and random gradient and Hessian [1, 2, 16], approximated function and random gradient and Hessian [4, 10], random function gradient and Hessian [5, 8, 11, 17, 32]. In the class of Inexact Newton method with random models we mention [7, 12, 14, 19, 26].

3 Bounded noise on f

In this section we present and analyze an Inexact Newton method with line-search where the function evaluation is noisy in the sense of Assumption 2.3.

At iteration k, given $x_k$ and the steplength $t_k$, a non-monotone Armijo condition given in [10] is used. It employs the known upper bound $\varepsilon _f$ introduced in Assumption 2.3 and has the form

$$\begin{aligned} {{\tilde{f}}}(x_k + t_k s_k) \le {{\tilde{f}}}(x_k) + c\, t_k s_k^T g_k + 2\varepsilon _f, \end{aligned}$$

(24)

$c\in (0,1)$. If $x_k+t_ks_k$ satisfies (24) we say that the iteration is successful, we accept the step and increase the step-length $t_k$ for the next iteration. Otherwise the step is rejected and the step-length $t_k$ is reduced; at the next iterate $g_k$ and $H_k$ are supposed to be computed from scratch. Our procedure belongs to the framework given in Section 4.2 of [10] and it is sketched in Algorithm 3.1.

Similarly to [10, 16], Algorithm 3.1 generates a stochastic process. Given $x_{k}$ and $t_{k}$, the iterate $x_{k+1}$ is fully determined by $g_{k}$, $H_k$ and the noise in the function value estimation during iteration k.

Concerning the well definiteness of the linesearch strategy (24), we now prove that if the iteration is true and $t_k$ is small enough, the linesearch condition is satisfied.

Lemma 3.1

Suppose that Assumptions 2.2, 2.3, 2.5 and 2.6 hold. Suppose that iteration k is true and consider any realization of Algorithm 3.1. Then the iteration is successful whenever $t_k \le \bar{t}= \frac{2\beta \kappa _1(1-c)}{\kappa _1\lambda _n+2}$.

Proof

Let k be an arbitrary iteration. Inequalities (15) and (18) imply, using the standard arguments for functions with bounded Hessians,

$$\begin{aligned}&{{\tilde{f}}}(x_k + t_ks_k) - \varepsilon _f \le f(x_{k}+t_{k} s_{k}) \\&\quad = f(x_{k}) + \int _{0}^{1} [\nabla f(x_{k}+{\zeta } t_{k} s_{k})]^{T} ( t_{k}s_{k}) d{\zeta }\\&\quad = f(x_{k}) + \int _{0}^{1} t_k\left( [\nabla f(x_{k}+{\zeta } t_{k} s_{k})]^{T} s_{k} \pm \nabla f(x_k)^Ts_k\right) d{\zeta }\\&\quad = f(x_{k}) + \int _{0}^{1} t_k[\nabla f(x_{k}+{\zeta } t_{k} s_{k}) - \nabla f(x_k)]^{T} s_{k} d{\zeta } + t_k\nabla f(x_k)^Ts_k\\&\quad \le f(x_{k}) + \int _{0}^{1} t_k\Vert \nabla f(x_{k}+{\zeta } t_{k} s_{k}) - \nabla f(x_k)\Vert \Vert s_{k}\Vert d{\zeta } + t_k\nabla f(x_k)^Ts_k\\&\quad \le f(x_{k}) + \frac{\lambda _n}{2} t_{k}^2 \Vert s_{k}\Vert ^2 + t_k\nabla f(x_k)^Ts_k\\&\quad \le {{\tilde{f}}}(x_{k}) + \varepsilon _f + \frac{\lambda _n}{2} t_{k}^2 \Vert s_{k}\Vert ^2 + t_k\nabla f(x_k)^Ts_k. \end{aligned}$$

Since iteration k is true by assumption then $\Vert \nabla f(x_k)- g_k\Vert \le t_k \eta _k \Vert g_k\Vert $ holds, and using Lemma 2.1 we obtain

$$\begin{aligned} {{\tilde{f}}}(x_k+t_ks_k)\le & {} {{\tilde{f}}}(x_{k}) + 2\varepsilon _f + \frac{\lambda _n}{2} t_{k}^2 \Vert s_{k}\Vert ^2 + t_k \nabla f(x_k)^Ts_k \pm t_k g_k^Ts_k \nonumber \\= & {} {{\tilde{f}}}(x_{k}) + 2\varepsilon _f + \frac{\lambda _n}{2} t_{k}^2 \Vert s_{k}\Vert ^2 + t_k [\nabla f(x_k)- g_k]^Ts_k + t_k g_k^Ts_k \nonumber \\\le & {} {{\tilde{f}}}(x_{k}) + 2\varepsilon _f + \frac{\lambda _n}{2} t_{k}^2 \Vert s_{k}\Vert ^2 + t_k^2 \frac{\eta _k}{\kappa _1} \Vert s_k\Vert ^2 + t_k g_k^Ts_k. \end{aligned}$$

(25)

Then, the linesearch condition (24) is clearly enforced whenever

$$\begin{aligned} {{\tilde{f}}}(x_{k}) + 2\varepsilon _f + \frac{\lambda _n}{2} t_{k}^2 \Vert s_{k}\Vert ^2 + t_k^2 \frac{\eta _k}{\kappa _1} \Vert s_k\Vert ^2 + t_k g_k^Ts_k \le {{\tilde{f}}}(x_k) + c t_k s_k^T g_k + 2\varepsilon _f \end{aligned}$$

which gives

$$\begin{aligned} t_k \Vert s_k\Vert ^2 \left( \frac{\lambda _n}{2} + \frac{\eta _k}{\kappa _1} \right) \le - (1-c) g_k^Ts_k . \end{aligned}$$

Using (4) we have $-(1-c) g_k^Ts_k \ge (1-c) \beta \Vert s_k \Vert ^2 $. Since $\eta _k<{\bar{\eta }}<1$, if

$$\begin{aligned} t_k \Vert s_k\Vert ^2 \left( \frac{\lambda _n}{2} + \frac{1}{\kappa _1} \right) \le (1-c) \beta \Vert s_k \Vert ^2, \end{aligned}$$

then (24) holds and this yields the thesis. $\square $

3.1 Complexity analysis of the stochastic process

In this section we carry out the convergence analysis of Algorithm 3.1. To this end we provide a bound on the expected number of iterations that the algorithm takes before it achieves a desired level of accuracy in the optimality gap $f(x_k)-f^*$ with $f^*=f(x^*)$ being the minimum value attained by f. Such a number of iteration is defined formally below.

Definition 3.2

Let $x^*$ be the global minimizer of f and $f^* = f(x^*)$. Given some $\epsilon >0$, $N_\epsilon $ is the number of iterations required until $f(x_k) - f^* \le \epsilon $ occurs for the first time.

The number of iterations $N_\epsilon $ is a random variable and it can be defined as the hitting time for our stochastic process. Indeed it has the property $\sigma ({\mathbbm {1}} \{N_\epsilon > k\}) \subset {\mathcal {F}}_{k-1}$.

Following the notation introduced in Sect. 2 we let $X_k$, $k\ge 0$, be the random variable with realization $x_k=X_k(\omega _k)$ and consider the following measure of progress towards optimality:

$$\begin{aligned} Z_k= \log \left( \frac{f(X_0)-f^*}{f(X_k)-f^*} \right) . \end{aligned}$$

(26)

Further, we let

$$\begin{aligned} Z_\epsilon = \log \left( \frac{f(X_0)-f^*}{\epsilon } \right) \end{aligned}$$

(27)

be an upper bound for $Z_k$ for any $k<N_\epsilon $. We denote with $z_k=Z_k(\omega _k)$ a realization of the random quantity $Z_k$.

A theoretical framework for analyzing a generic line search with noise has been developed in [10]. Under a suitable set of conditions, it provides the expected value for $N_\epsilon $. We state a result from [10] and will exploit it for our algorithm.

Theorem 3.3

Suppose that Assumptions 2.2, 2.3, 2.5, 2.6 hold. Let $z_k$ a realization of $Z_k$ in (26) and suppose that there exist a constant ${\bar{t}} > 0$, a nondecreasing function $h(t) : {\mathbb {R}}^+\rightarrow {\mathbb {R}}$, which satisfies $h(t) > 0$ for any $t \in (0,t_{\max }]$, and a nondecreasing function $r(\varepsilon _f ) : {\mathbb {R}}\rightarrow {\mathbb {R}}$, which satisfies $r(\varepsilon _f ) \ge 0$ for any $\varepsilon _f \ge 0$, such that for any realization of Algorithm 3.1 the following holds for all $k < N_\epsilon $:

(i)
If iteration k is true and successful, then $z_{k+1} \ge z_k +h(t_k)- r(\varepsilon _f )$.
(ii)
If $t_k \le {\bar{t}}$ and iteration k is true then iteration k is also successful, which implies $t_{k+1} = \tau ^{-1}t_k$.
(iii)
$z_{k+1} \ge z_k - r(\varepsilon _f )$ for all successful iterations k and $z_{k+1} \ge z_k$ for all unsuccessful iteration k.
(iv)
The ratio $r(\varepsilon _f )/h({\bar{t}})$ is bounded from above by some $\gamma \in (0, 1)$.

Then under the condition that the probability $\delta _g$ in Assumption 2.5 is such that $\delta _g < \frac{1}{2} - \frac{\sqrt{\gamma }}{2} $, the stopping time $N_\epsilon $ is bounded in expectation as follows

$$\begin{aligned} {{\mathbb {E}}}[N_{\epsilon }] \le \frac{2(1-\delta _g)}{(1-2\delta _g)^2-\gamma }\left[ \frac{2Z_\epsilon }{h({\bar{t}})} + (1-\gamma )\log _{\tau }\frac{\bar{t}}{t_0} \right] . \end{aligned}$$

(28)

Proof

See [10, Assumption 3.3 and Theorem 3.13]. $\square $

We show that our algorithm satisfies the assumptions in Theorem 3.3 if the magnitude of $\epsilon $ fulfills the following condition.

Assumption 3.4

Let $c\in (0,1)$ as in (24), $\beta $, $\kappa _1$ as in Lemma 2.1, $\lambda _1$ as in Assumption 2.2, ${\bar{t}}$ as in Lemma 3.1, $t_{\max }$ as in Algorithm 3.1. Assume that $\epsilon $ in Definition 3.2 is such that

$$\begin{aligned} \epsilon > \frac{4\varepsilon _f}{\left( 1- M\right) ^{-\gamma }-1} \end{aligned}$$

(29)

where $M=\frac{c\beta \kappa _1^2\lambda _1 {\bar{t}}}{(1+t_{\max })^2}$, for some $\gamma \in (0,1)$ such that $(1-M)^{-\gamma }<2$.

Note that $M\in (0,1)$ due to the definition of ${\bar{t}}$ in Lemma 3.1 and the smaller M is, the larger is $\epsilon $ with respect to $\varepsilon _f$.

First, we provide a relation between $z_k$ and $z_{k+1}$ of the form specified in item (i), Theorem 3.3.

Lemma 3.5

Suppose that Assumptions 2.2, 2.3, 2.5, 2.6 and 3.4 hold. Consider any realization of Algorithm 3.1. If the k-th iterate is true and successful, then

$$\begin{aligned} z_{k+1} \ge z_k - \log \left( 1-c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_{\max })^2}\right) - \log \left( 1 + \frac{4\varepsilon _f}{\epsilon }\right) , \end{aligned}$$

(30)

whenever $k<N_{\epsilon }$.

Proof

By (17), $f(x_k)-f^* \le \frac{1}{2\lambda _1}\Vert \nabla f(x_k) \Vert ^2$. Using (22)

$$\begin{aligned} \Vert g_k \Vert ^2 \ge \left( \frac{1}{1+t_k\eta _k} \right) ^2 \Vert \nabla f(x_k) \Vert ^2 \ge 2\lambda _1 \left( \frac{1}{1+t_k\eta _k}\right) ^2 ( f(x_k)-f^*). \end{aligned}$$

(31)

Combining condition (18), (24) and Lemma 2.1 it holds

$$\begin{aligned} f(x_k) - f(x_{k+1}) + 2 \varepsilon _f \ge {{\tilde{f}}}(x_k) - \tilde{f}(x_{k+1})\ge & {} -ct_k g_k^T s_k - 2 \varepsilon _f \nonumber \\\ge & {} ct_k \beta \kappa _1^2 \Vert g _k\Vert ^2 - 2 \varepsilon _f, \end{aligned}$$

(32)

and thus, using (31),

$$\begin{aligned} f(x_k) - f(x_{k+1}) \pm f^*\ge 2 ct_k \beta \kappa _1^2\lambda _1 \left( \frac{1}{1+t_k\eta _k}\right) ^2 ( f(x_k)-f^*) - 4 \varepsilon _f. \end{aligned}$$

Then it holds

$$\begin{aligned} f(x_{k+1}) -f^* \le \left( 1-2 c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_k\eta _k)^2} \right) ( f(x_k)-f^*) + 4 \varepsilon _f. \end{aligned}$$

We define $\Delta _k^f = f(x_k) - f^*$. Because of $f(x_k)-f^*>\epsilon $, we have

$$\begin{aligned} \Delta _{k+1}^f\le & {} \left( 1-2 c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_k\eta _k)^2} + \frac{4\varepsilon _f}{\epsilon }\right) \Delta _k^f\\\le & {} \left( 1- c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_k\eta _k)^2} -\frac{4\varepsilon _f}{\epsilon }c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_k\eta _k)^2} + \frac{4\varepsilon _f}{\epsilon }\right) \Delta _k^f\\= & {} \left( 1-c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_k\eta _k)^2} \right) \left( 1 + \frac{4\varepsilon _f}{\epsilon }\right) \Delta _k^f\\\le & {} \left( 1-c \beta \kappa _1^2\lambda _1\frac{t_k}{(1+t_{\max })^2} \right) \left( 1 + \frac{4\varepsilon _f}{\epsilon }\right) \Delta _k^f\\ \end{aligned}$$

where the second inequalities holds thanks to Assumption 3.4, because $4\varepsilon _f <\epsilon $ and the last one holds since $t_k\le t_{\max }$ and $\eta _k< 1$.

Notice that since $\left( 1+ \frac{4\varepsilon _f}{\epsilon }\right) >~0$, $\Delta _k^f > 0$ and $\Delta _{k+1}^f \ge 0$, it holds $\left( 1-c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_{\max })^2}\right) \ge ~0$. Now, taking the inverse and then the $\log $ of both sides, adding $\log \Delta _0^f$, we have

$$\begin{aligned} \log \left( \frac{\Delta _0^f}{\Delta _{k+1}^f} \right) \ge \log \left( \frac{\Delta _0^f}{\Delta _k^f} \right) - \log \left( 1-c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_{\max })^2}\right) - \log \left( 1 + \frac{4\varepsilon _f}{\epsilon }\right) , \end{aligned}$$

which completes the proof.

$\square $

The next lemma analyzes item (ii) of Theorem 3.3

Lemma 3.6

Suppose that Assumptions 2.2, 2.3, 2.5 and 2.6 hold. Consider any realization of Algorithm 3.1. For every iteration that is false and successful, we have

$$\begin{aligned} z_{k+1} \ge z_k - \log \left( 1+\frac{4\varepsilon _f}{\epsilon } \right) . \end{aligned}$$

Moreover $z_{k+1}=z_k$ for any unsuccessful iteration.

Proof

For every false and successful iteration, using (18), (24) and (4) we have

$$\begin{aligned} f(x_{k+1})\le & {} f(x_k) + ct_k s_k^\top g_k + 4\varepsilon _f \\\le & {} f(x_k) + 4\varepsilon _f, \end{aligned}$$

thus, because of $f(x_k)-f^*>\epsilon $,

$$\begin{aligned} f(x_{k+1}) -f^*\le & {} f(x_k) - f^* + 4\varepsilon _f \\\le & {} \left( 1+\frac{4\varepsilon _f}{\epsilon } \right) (f(x_k) - f^*). \end{aligned}$$

So it holds $\Delta _{k+1}^f \le \left( 1+\frac{4\varepsilon _f}{\epsilon } \right) \Delta _k^f$. Now taking the inverse and then the log of both sides, adding $\log \Delta _0^f$ we have

$$\begin{aligned} \log \left( \frac{\Delta _0^f}{\Delta _{k+1}^f} \right) \ge \log \left( \frac{\Delta _0^f}{\Delta _k^f} \right) - \log \left( 1 + \frac{4\varepsilon _f}{\epsilon }\right) \end{aligned}$$

which completes the first part of the proof. Finally, for any unsuccessful iteration $z_{k+1}=z_k$ follows by Step 3 of Algorithm 3.1 that provides $x_{k+1}=x_k$ and hence $f(x_{k+1})=f(x_k)$. $\square $

We can now summarize our results. First, note that $\left( 1-c \beta \kappa _1^2\lambda _1 \frac{t_k}{(1+t_{\max })^2}\right) \ge 0$ for all $t_k\in [0,t_{\max }]$, due to (6) and (8). Second, let

$$\begin{aligned} h(t) = - \log \left( 1-c \beta \kappa _1^2\lambda _1\frac{t}{(1+t_{\max })^2}\right) \quad \text {and} \quad r(\varepsilon _f) = \log \left( 1 + \frac{4\varepsilon _f}{\epsilon }\right) . \end{aligned}$$

(33)

It is easy to see that h(t) is monotone and non increasing if $t\in [0,t_{\max }]$.

Combining Lemmas 3.1, 3.5 and 3.6 , we have that for any realization of Algorithm 3.1 and $k< N_\epsilon $ with $\epsilon $ as in Assumption 3.4:

(i)
(Lemma 3.5) If iteration k is true and successful, then $z_{k+1} \ge z_k +h(t_k)- r(\varepsilon _f )$.
(ii)
(Lemma 3.1) If $t_k \le {\bar{t}}$ and iteration k is true then iteration k is also successful, which implies $t_{k+1} = \tau ^{-1}t_k$.
(iii)
(Lemma 3.6) $z_{k+1} \ge z_k - r(\varepsilon _f )$ for all successful iterations k and $z_{k+1} = z_k$ for all unsuccessful iteration k.
(iv)
(Assumption 3.4) The ratio $r(\varepsilon _f )/h({\bar{t}})$ is bounded from above by some $\gamma \in ~(0, 1)$.

Hence, we can use Theorem 3.3 and get the following boun on ${{\mathbb {E}}}[N_\epsilon ]$,

$$\begin{aligned} {{\mathbb {E}}}[N_\epsilon ] \le \frac{2(1-\delta _g)}{(1-2\delta _g)^2-\gamma }\left[ {2 \log _{1/(1-M)} \left( \frac{f(x_0)-f^*}{\epsilon } \right) } + (1-\gamma )\log _{\tau }\frac{{\bar{t}}}{t_0} \right] \end{aligned}$$

with M given in Assumption 3.4. This result is valid under Assumption 3.4, namely for sufficiently large values of $\epsilon $. The fact that $\epsilon $ cannot be arbitrarily small is consistent with the presence of noise $\varepsilon _f$ in f-evaluations. Trivially, if $\varepsilon _f=0$ then the optimality gap $f(x_k)-f^*$ can be made arbitrarily small.

4 Decreasing noise on f

In this section we present an Inexact Newton algorithm suitable to the case where f-evaluations can be performed with adaptive accuracy. Letting $c \in (0,\frac{1}{2})$, we use the linesearch condition

$$\begin{aligned} {{\tilde{f}}}(x_k + t_k s_k) \le {{\tilde{f}}}(x_k) + c t_k s_k^T g_k , \end{aligned}$$

(34)

where ${{\tilde{f}}}(x_k)$ and ${{\tilde{f}}}(x_k + t_k s_k)$ satisfy Assumption 2.4 with $\theta < \frac{c}{2}$. In fact, (34) has the form of the classical Armijo condition but the true f is replaced by the approximation ${{\tilde{f}}}$.

The resulting algorithm is given below.

The following Lemma shows that a successful iteration is guaranteed whenever it is true and $t_k$ is sufficiently small.

Lemma 4.1

Suppose that Assumptions 2.2, 2.4 with $\theta <\frac{c}{ 2}$, 2.5 and 2.6 hold. Suppose that iteration k is true and consider any realization of Algorithm 4.1. Then the iteration is successful whenever $t_k\le {\bar{t}} = \frac{2\kappa _1\beta }{\kappa _1\lambda _n+2}(1-c-2\theta )$.

Proof

Using the same arguments as in Lemma 3.1, using (19) and (34), rather than (18) and (24) we obtain

$$\begin{aligned} {{\tilde{f}}}(x_k + t_ks_k)\le & {} {{\tilde{f}}}(x_{k}) - 2\theta t_ks_k^Tg_k + \frac{\lambda _n}{2} t_{k}^2 \Vert s_{k}\Vert ^2 + t_k^2 \frac{\eta _k}{\kappa _1} \Vert s_k\Vert ^2 + t_k g_k^Ts_k. \end{aligned}$$

The linesearch condition (34) is clearly enforced whenever

$$\begin{aligned} {{\tilde{f}}}(x_{k}) - 2\theta t_ks_k^Tg_k + \frac{\lambda _n}{2} t_{k}^2 \Vert s_{k}\Vert ^2 + t_k^2 \frac{\eta _k}{\kappa _1} \Vert s_k\Vert ^2 + t_k g_k^Ts_k \le {{\tilde{f}}}(x_k) + c t_k s_k^T g_k \end{aligned}$$

which gives

$$\begin{aligned} t_k \Vert s_k\Vert ^2 \left( \frac{\lambda _n}{2} + \frac{\eta _k}{\kappa _1} \right) \le - (1-c-2\theta ) g_k^Ts_k. \end{aligned}$$

Note that $1-c-2\theta >0$ by $c\in (0,1/2)$ and $\theta \in (0, \frac{c}{2})$. Using (4) we have $-(1-c-2\theta ) g_k^Ts_k \ge (1-c-2\theta ) \beta \Vert s_k \Vert ^2 $. Then, since $\eta _k<\bar{\eta }<1$, (34) holds if

$$\begin{aligned} t_k \Vert s_k\Vert ^2 \left( \frac{\lambda _n}{2} + \frac{1}{\kappa _1} \right) \le (1-c-2\theta ) \beta \Vert s_k \Vert ^2 , \end{aligned}$$

and this yields the thesis. $\square $

4.1 Complexity analysis of the stochastic process

The behaviour of the method is studied analyzing the hitting time $N_\epsilon $ in Definition 3.2. In particular, we first show the following two results on the realization $z_k$ of the variable $Z_k$ in (26).

Lemma 4.2

Suppose that Assumptions 2.2, 2.4 with $\theta <\frac{c}{ 2}$, 2.5 and 2.6 hold. If the k-th iterate of Algorithm 4.1 is true and successful, for any realization of the Algorithm 4.1 we have

$$\begin{aligned} z_{k+1} \ge z_k - \log \left( 1- 2 \beta \kappa _1^2 \lambda _1 \left( c-2\theta \right) \frac{t_k}{(1+t_{\max })^2} \right) , \end{aligned}$$

(35)

whenever $k<N_{\epsilon }$.

Proof

Using the same arguments as in Lemma 3.5, using (19) and (34), rather than (18) and (24) we obtain

$$\begin{aligned} f(x_{k}) -f(x_{k+1})\ge & {} -(c-2\theta )t_k g_k^Ts_k\\\ge & {} t_k \beta \kappa _1^2\left( c-2\theta \right) \Vert g_k \Vert ^2, \end{aligned}$$

where the second inequality comes from (4). Thus

$$\begin{aligned} f(x_{k+1})-f^* \le f(x_k) -f^* -t_k\beta \kappa _1^2(c-2\theta ) \Vert g_k \Vert ^2 \end{aligned}$$

and using (31) we get

$$\begin{aligned} f(x_{k+1})-f^* \le \left( 1- 2 \beta \kappa _1^2 \lambda _1 \left( c-2\theta \right) \frac{t_k}{(1+t_{\max })^2} \right) ( f(x_k)-f^*). \end{aligned}$$

Now proceeding as in Lemma 3.5 we have the thesis. $\square $

Lemma 4.3

Suppose that Assumptions 2.2, 2.4 with $\theta <\frac{c}{ 2}$, 2.5 and 2.6 hold. For any realization of Algorithm 4.1 we have

$$\begin{aligned} z_{k+1} > z_k, \end{aligned}$$

if the iteration k is false and successful,

$$\begin{aligned} z_{k+1}=z_k, \end{aligned}$$

if the iteration k is unsuccessful.

Proof

For every false and successful iteration, using (19) and (34),we have

$$\begin{aligned} f(x_{k+1})\le & {} f(x_k) + ct_k s_k^T g_k - 2\theta t_kg_k^Ts_k \\= & {} f(x_k) + (c-2\theta ) t_k s_k^Tg_k < f(x_k), \end{aligned}$$

and in case of unsuccessful iteration Step 4 of the algorithm provides $x_{k+1}=x_k$. Then, due to the definition of $Z_k$ in (26) the thesis follows.

$\square $

Now we can state the main result on the expected value for the hitting time.

Theorem 4.4

Suppose that Assumptions 2.2, 2.4 with $\theta <\frac{c}{2}$, 2.5 and 2.6 hold and let $\bar{t}$ given in Lemma 4.1. Then under the condition that the probability $\delta _g$ in Assumption 2.5 is such that $\delta _g < \frac{1}{2} $, the stopping time $N_\epsilon $ is bounded in expectation as follows

$$\begin{aligned} {{\mathbb {E}}}[N_\epsilon ] \le \frac{2(1-\delta _g)}{(1-2\delta _g)^2}\left[ {2 \log _{1/(1-M)} \left( \frac{f(x_0)-f^*}{\epsilon } \right) } + \log _{\tau }\frac{{\bar{t}}}{t_0} \right] \end{aligned}$$

with $M=\frac{2(c-2\theta )\beta \kappa _1^2\lambda _1 \bar{t}}{(1+t_{\max })^2}$.

Proof

Let

$$\begin{aligned} h(t) = - \log \left( 1- 2 \beta \kappa _1^2 \lambda _1 \left( c-2\theta \right) \frac{t}{(1+t_{\max })^2} \right) , \end{aligned}$$

(36)

and note that h(t) is non decreasing for $t\in [0,t_{\max }]$ and that $h(t)> 0$ for $t\in [0,t_{\max }]$. For any realization $z_k$ of $Z_k$ in (26) of Algorithm 4.1 the following hold for all $k < N_\epsilon $:

(i)
If iteration k is true and successful, then $z_{k+1} \ge z_k +h(t_k)$ by Lemma 4.2.
(ii)
If $t_k \le {\bar{t}}$ and iteration k is true then iteration k is also successful, which implies $t_{k+1} = \tau ^{-1}t_k$ by Lemma 4.1.
(iii)
$z_{k+1} \ge z_k $ for all successful iterations k ($z_{k+1} = z_k$ for all unsuccessful iteration k), by Lemma 4.3.

Moreover, our stochastic process $\{{{\mathcal {T}}_k}, Z_k\}$ obeys the expressions below. By Lemma 4.1 and the definition of Algorithm 4.1 the update of the random variable $\mathcal T_k $ such that $t_k = {\mathcal {T}}_k(\omega _k)$ is

$$\begin{aligned} {\mathcal {T}}_{k+1}= \left\{ \begin{array}{ll} \tau ^{-1} {\mathcal {T}}_k &{} \text{ if } I_k=1 , \, {\mathcal {T}}_k \le {\bar{t}} \ \text{(i.e., } \text{ successful } \text{ iteration) }\\ \tau ^{-1} {\mathcal {T}}_k &{} \text{ if } \text{ the } \text{ iteration } \text{ is } \text{ successful }, \, I_k=0 , \, {\mathcal {T}}_k \le {\bar{t}} \\ \tau \, {\mathcal {T}}_k &{} \text{ if } \text{ the } \text{ iteration } \text{ is } \text{ unsuccessful }, \, I_k=0 , \, {\mathcal {T}}_k \le {\bar{t}} \\ \tau ^{-1} {\mathcal {T}}_k &{} \text{ if } \text{ the } \text{ iteration } \text{ is } \text{ successful },{\mathcal {T}}_k> {\bar{t}} \\ \tau \, {\mathcal {T}}_k &{} \text{ if } \text{ the } \text{ iteration } \text{ is } \text{ unsuccessful }, {\mathcal {T}}_k > {\bar{t}}, \\ \end{array} \right. \end{aligned}$$

where the event $I_k$ is defined in 20. By Lemmas 4.1, 4.2 and 4.3 the random variable $Z_k $ obeys the expression

$$\begin{aligned} Z_{k+1}\ge \left\{ \begin{array}{ll} Z_k +h({\mathcal {T}}_k)&{} \text{ if } \text{ the } \text{ iteration } \text{ is } \text{ successful } \text{ and } \, I_k=1 \\ Z_k &{} \text{ if } \text{ the } \text{ iteration } \text{ is } \text{ successful } \text{ and } I_k=0 \\ Z_k&{} \text{ if } \text{ the } \text{ iteration } \text{ is } \text{ unsuccessful }\\ \end{array} \right. \end{aligned}$$

Then Lemma 2.2–Lemma 2.7 and Theorem 2.1 in [16] hold which gives the thesis. $\square $

4.2 Local convergence

We conclude our study analyzing the local behaviour of the Newton-CG method employing gradient estimates $(1-\delta _g)$-probabilistically sufficiently accurate, i.e. satisfying Assumption 2.5 and Hessian estimates satisfying the following assumption.

Assumption 4.5

The Hessian of the objective function f is Lipschitz-continuous with constant $L_H>0$,

$$\begin{aligned} \Vert \nabla ^2 f(x)-\nabla ^2 f(y) \Vert \le L_H \Vert x-y \Vert , \quad \forall x, y \in {\mathbb {R}}^n. \end{aligned}$$

(37)

Given a constant $C>0$, the Hessian estimator is $(1-\delta _H)$-probabilistically sufficiently accurate in the sense that the indicator variable

$$\begin{aligned} J_k={\mathbbm {1}}\{ \Vert {\mathcal {H}}_k - \nabla ^2 f(X_k)\Vert \le C\eta _k\} \end{aligned}$$

satisfies the submartingale condition

$$\begin{aligned} P( J_k=1 | {\mathcal {F}}_{k-1} ) \ge 1-\delta _H, \quad \delta _H\in (0,1). \end{aligned}$$

(38)

We let $t_{\max }=1$, so that the maximum step-size gives the full CG step $s_k$.

The following lemma shows that if the full CG step $s_k$ is accepted then the error linearly decreases with a certain probability. Further, the same occurrence over $\ell $ successive iterations is analyzed.

Lemma 4.6

Suppose that Assumptions 2.2, 2.4 with $\theta <\frac{c}{2}$, 2.5, 2.6 and 4.5 hold. Let $x_{{\bar{k}}}$ be a realization of Algorithm 4.1 with $t_{{\bar{k}}}=1$. Assume that the iteration is successful and $\Vert x_{{\bar{k}}}-x^*\Vert $ and $\eta _{{\bar{k}}}$ are sufficiently small so that $\frac{1}{\lambda _1}\left[ \frac{L_H}{2} \Vert x_{{\bar{k}}}-x^*\Vert + C \eta _{{\bar{k}}} + \frac{2\lambda _n\eta _{{\bar{k}}}}{1-{\bar{\eta }}} \right]<~{{\tilde{C}}}<~1$. Then, at least with probability $p=(1-\delta _g)(1-\delta _H)$, it holds

$$\begin{aligned} \Vert x_{{\bar{k}}+1}-x^* \Vert < {{\tilde{C}}} \Vert x_{{\bar{k}}}-x^* \Vert . \end{aligned}$$

If $\{ \eta _k \}$ is a non-increasing sequence and the iterations ${\bar{k}},\ldots ,{\bar{k}}+\ell -1$ are successful with $t_k=1$ for $k={\bar{k}},\ldots ,{\bar{k}}+\ell -1$, then it holds $\Vert x_{ k+1} -x^* \Vert < \Vert x_{ k} -x^* \Vert $ for $k={\bar{k}},\ldots ,{\bar{k}}+\ell -1$, at least with probability $p^l$.

Proof

$$\begin{aligned} \Vert x_{k+1}-x^* \Vert= & {} \Vert x_k + s_k-x^* \Vert \\= & {} \Vert x_k - H_k^{-1}g_k + H_k^{-1}r_k -x^* \Vert \\= & {} \Vert H_k^{-1} [\nabla ^2 f(x_k)(x_k-x^*) - \nabla ^2 f(x_k)(x_k-x^*)\\&+ H_k(x_k-x^*) - g_k \pm \nabla f(x_k)+ r_k ]\Vert \\\le & {} \Vert H_k^{-1}\Vert ( \Vert \nabla ^2 f(x_k)(x_k-x^*) -\nabla f(x_k)\Vert \\&+ \Vert (\nabla ^2 f(x_k) - H_k) (x_k-x^*)\Vert + \Vert g_k -\nabla f(x_k)\Vert + \Vert r_k\Vert ) \\\le & {} \frac{1}{\lambda _1} \Big ( \Vert \nabla ^2 f(x_k)(x_k-x^*) -\nabla f(x_k)\Vert \\&+ \Vert (\nabla ^2 f(x_k) - H_k) (x_k-x^*)\Vert + \Vert g_k -\nabla f(x_k)\Vert + \Vert r_k\Vert \Big ). \end{aligned}$$

Thanks to (37) it holds

$$\begin{aligned}&\Vert \nabla ^2 f(x_k)(x_k-x^*) -\nabla f(x_k)\Vert \\&\quad =\left\| \int _{0}^{1} [\nabla ^2f(x_k) - \nabla ^2f(x^*+\zeta (x_k-x^*))](x_k-x^*) d\zeta \right\| \\&\quad \le \int _{0}^{1} \left\| \nabla ^2f(x_k) - \nabla ^2f(x_k-(1-\zeta )(x_k-x^*)) \right\| d\zeta \, \left\| x_k-x^* \right\| \\&\quad \le \int _{0}^{1} L_H\, (1-\zeta ) d\zeta \, \left\| x_k-x^* \right\| ^2= \frac{L_H}{2} \Vert x_k-x^*\Vert ^2. \end{aligned}$$

Let us assume that both the events $I_k $ and $J_k$ are true. Then, $\Vert g_k-\nabla f(x_k)\Vert \le \eta _k \Vert g_k\Vert $,

$$\begin{aligned} \Vert (\nabla ^2 f(x_k) - H_k) (x_k-x^*)\Vert\le & {} \Vert \nabla ^2 f(x_k) - H_k\Vert \Vert x_k-x^*\Vert \\\le & {} C \eta _k \Vert x_k-x^*\Vert , \end{aligned}$$

and by (2)

$$\begin{aligned} \Vert g_k-\nabla f(x_k)\Vert + \Vert r_k \Vert \le 2\eta _k \Vert g_k \Vert . \end{aligned}$$

Moreover,

$$\begin{aligned} \Vert g_k \Vert \le \Vert g_k-\nabla f(x_k) \Vert + \Vert \nabla f(x_k) \Vert \le \eta _k \Vert g_k \Vert + \Vert \nabla f(x_k) \Vert \end{aligned}$$

i.e, $\Vert g_k \Vert \le \frac{1}{1-\eta _k}\Vert \nabla f(x_k) \Vert $. Then combining with (16) we have

$$\begin{aligned} \Vert g_k \Vert \le \frac{1}{1-\eta _k}\Vert \nabla f(x_k) \Vert \le \lambda _n\frac{1}{1-\eta _k}\Vert x_k -x^* \Vert \le \lambda _n\frac{1}{1-{\bar{\eta }}}\Vert x_k -x^* \Vert . \end{aligned}$$

Therefore

$$\begin{aligned} \Vert g_k-\nabla f(x_k)\Vert + \Vert r_k \Vert \le \frac{2\eta _k\lambda _n}{1-{\bar{\eta }}}\Vert x_k -x^* \Vert . \end{aligned}$$

Then, since $P(I_k\cap J_k)\ge p$ it follows

$$\begin{aligned} \Vert x_{k+1}-x^* \Vert\le & {} \frac{1}{\lambda _1} \left[ \frac{L_H}{2} \Vert x_k-x^*\Vert ^2 + C \eta _k \Vert x_k-x^*\Vert + \frac{2\lambda _n\eta _k}{1-{\bar{\eta }}} \Vert x_k-x^*\Vert \right] \\ \end{aligned}$$

at least with probability p.

Therefore, since at iteration ${\bar{k}}$, $\frac{1}{\lambda _1}\left[ \frac{L_H}{2} \Vert x_{{\bar{k}}}-x^*\Vert + C \eta _{{\bar{k}}} + \frac{2\lambda _n\eta _{{\bar{k}}}}{1-{\bar{\eta }}} \right]<~{{\tilde{C}}}<~1$ by assumption, it follows $\Vert x_{\bar{k}+1}-x^* \Vert < {{\tilde{C}}} \Vert x_{{\bar{k}}}-x^* \Vert $.

At iteration ${\bar{k}}+1$, $t_{{\bar{k}}+1}=1$ and the iteration is successful by hypothesis. Then, we can repeat the previous arguments and the thesis follows. $\square $

5 Finite sum case

In this section we consider the finite-sum minimization problem that arises in machine learning and data analysis:

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} f(x) = \frac{1}{N} \sum _{i=1}^N f_i(x). \end{aligned}$$

(39)

The objective function f is the mean of N component functions $f_i :{\mathbb {R}}^n \rightarrow {\mathbb {R}}$ and for large values of N, the exact evaluation of the function and derivatives might be computationally expensive. We suppose that each $f_i$ is strongly convex.

Following [1, 2, 16] f is evaluated exactly while the approximations $g_k$ and $H_k$ to the gradient and the Hessian respectively satisfy accuracy requirements in probability.

The evaluations of $g_k$ and $H_k$ can be made using subsampling, that means picking randomly and uniformly chosen subsets of indexes ${{{\mathcal {N}}}}_{g,k}$ and ${{{\mathcal {N}}}}_{H,k}$ from ${{{\mathcal {N}}}}=\{ 1,\ldots ,N \}$ and define

$$\begin{aligned} g_k =\frac{1}{| {{{\mathcal {N}}}}_{g,k} |} \sum _{i\in {{{\mathcal {N}}}}_{g,k}} \nabla f_i (x_k), \quad \text { and } \quad H_k =\frac{1}{| {{{\mathcal {N}}}}_{H,k} |} \sum _{i\in {{{\mathcal {N}}}}_{H,k}} \nabla ^2 f_i (x_k). \end{aligned}$$

(40)

If $g_k$ and $H_k$ are required to be probabilistically sufficiently accurate as in Definition 2.5 and in Assumption 4.5 respectively, the sample sizes $|{{{\mathcal {N}}}}_{g,k} |$ and $|{{{\mathcal {N}}}}_{H,k} |$ can be determined by using the operator-Bernstein inequality introduced in [34]. As shown in [6], $g_k$ and $H_k$ are $(1-\delta _g)$ and $(1-\delta _H)$ -probabilistically sufficiently accurate if

$$\begin{aligned} |{{{\mathcal {N}}}}_{g,k} |\ge & {} \min \left\{ N, \frac{4 \kappa _{f,g}(x_k)}{\gamma _{g,k}} \left( \frac{\kappa _{f,g}(x_k)}{\gamma _{g,k}} +\frac{1}{3} \right) \log \left( \frac{n+1}{\delta _g} \right) \right\} , \end{aligned}$$

(41)

$$\begin{aligned} |{{{\mathcal {N}}}}_{H,k} |\ge & {} \min \left\{ N, \frac{4 \kappa _{f,H}(x_k)}{ C\eta _k } \left( \frac{\kappa _{f,H}(x_k)}{ C\eta _k } +\frac{1}{3} \right) \log \left( \frac{2n}{\delta _H} \right) \right\} , \end{aligned}$$

(42)

where $\gamma _{g,k}$ is an approximation of the required gradient accuracy, namely $\gamma _{g,k}\approx ~ t_k\eta _k \Vert G_k \Vert $ and under the assumption that, for any $x\in {\mathbb {R}}^n$, there exist non-negative upper bounds $\kappa _{f,g}$ and $\kappa _{f,H}$ such that

$$\begin{aligned} \max _{i\in \{ 1,\ldots ,N \}}\Vert \nabla f_i(x) \Vert&\le \kappa _{f,g}(x), \\ \max _{i\in \{ 1,\ldots ,N \}}\Vert \nabla ^2 f_i(x) \Vert&\le \kappa _{f,H}(x). \end{aligned}$$

A practical version of the procedure is shown in Algorithm 5.1. Gradient approximation requires a loop since the accuracy requirement is implicit; such a strategy is Step 2 of the following algorithm.

Inexact Newton methods for the finite-sum minimization problems are investigated also in [7, 12, 33]. In [7] it is analyzed a linesearch Newton-CG method where the objective function and the gradient are approximated by subsampling with increasing samplesizes determined by a prefixed rule. Random estimates of the Hessian with adaptive accuracy requirements as in Assumption 4.5 are employed and local convergence results in the mean square are given. In [12] the local convergence of Inexact Newton method is studied assuming to use prefixed choice of the sample size used to estimate by subsampling both the gradient and the Hessian. The paper [33] studies the global as well as local convergence behavior of linesearch Inexact Newton algorithms, where the objective function is exact and the Hessian and/or gradient are sub-sampled. A high probability analysis of the local convergence of the method is given, whereas we prove complexity results is expectation with noise in the objective function. Moreover, the estimators $g_k$ and $H_k$ are supposed to be $(1-\delta _g)$ and $(1-\delta _H)$ -probabilistic sufficiently accurate as in our approach but with different accuracy requirements. Predetermined and increasing accuracy requirements are used in [33] rather than the adaptive accuracy requirements in Assumption 2.5 and in Assumption 4.5.

6 Conclusion

In this paper we presented three Inexact Newton-CG methods with linesearch suitable for strongly convex functions with deterministic noise. Two type of noise, bounded noise and controllable noise on the objective function were considered. Regarding gradients, random approximations were allowed and their accuracy was supposed to be sufficiently high with a certain probability. The Hessians were possiby approximated by means of positive definite matrices.

We presented algorithms for the above two cases of noise on the objective function and analyzed the iteration complexity of the stochastic processes generated. In particular, we established a bound on the expected number of iterations that the algorithms take until the optimality gap reaches a desired accuracy for the first time. Successively, we studied the local behavior of the algorithm with controllable noise on the objective function and random approximations of the Hessian sufficiently accurate with a certain probability. Finally, the discussion was specialized to the case where f is a finite-sum of strongly convex function.

References

Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization 24(3), 1238–1264 (2014)
Article MathSciNet Google Scholar
Bellavia, S., Gurioli, G.: Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic Hessian accuracy. Optimization 71, 227–261 (2022)
Article MathSciNet Google Scholar
Bellavia, S., Gurioli, G., Morini, B.: Adaptive cubic regularization methods with dynamic inexact Hessian information and applications to finite-sum minimization. IMA Journal of Numerical Analysis 41(1), 764–799 (2021)
Article MathSciNet Google Scholar
Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L.: Adaptive regularization for nonconvex optimization using inexact function values and randomly perturbed derivatives. Journal of Complexity 68, 101591 (2022)
Article MathSciNet Google Scholar
Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L.: Trust-region algorithms: probabilistic complexity and intrinsic noise with applications to subsampling techniques, arXiv preprint arXiv:2112.06176, (2021)
Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L.: Adaptive Regularization Algorithms with Inexact Evaluations for Nonconvex Optimization. SIAM Journal on Optimization 29(4), 2281–2915 (2019)
Article MathSciNet Google Scholar
Bellavia, S., Krejic, N., Krklec Jerinkic, N.: Subsampled Inexact Newton methods for minimizing large sums of convex functions. IMA Journal of Numerical Analysis 40, 2309–2341 (2020)
Article MathSciNet Google Scholar
Bellavia, S., Krejić, N., Morini, B.: Inexact restoration with subsampled trust-region methods for finite-sum minimization. Computational Optimization and Applications 76, 701–736 (2020)
Article MathSciNet Google Scholar
Berahas, A.S., Bollapragada, R., Nocedal, J.: An Investigation of Newton-Sketch and Subsampled Newton Methods. Optimization Methods and Software 35, 661–680 (2020)
Article MathSciNet Google Scholar
Berahas, A. S., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise, SIAM Journal on Optimization, (2019)
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence Rate Analysis of a Stochastic Trust Region Method via Submartingales. INFORMS Journal on Optimization 1, 92–119 (2019)
Article MathSciNet Google Scholar
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and Inexact Subsampled Newton Methods for Optimization, IMA Journal Numerical Analysis, (2018)
Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Mathematical Programming 134(1), 127–155 (2012)
Article MathSciNet Google Scholar
Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM Journal on Optimization 21(3), 977–995 (2011)
Article MathSciNet Google Scholar
Carter, R.G.: On the global convergence of trust-region algorithms using inexact gradient information. SIAM Journal of Numerical Analysis 28, 251–265 (1991)
Article MathSciNet Google Scholar
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic model. Mathematical Programming 169(2), 337–375 (2017)
Article MathSciNet Google Scholar
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Mathematical Programming 169(2), 447–487 (2018)
Article MathSciNet Google Scholar
Dembo, R.S., Eisenstat, S.C., Steinhaug, T.: Inexact Newton method. SIAM Journal on Numerical Analysis 19(2), 400–409 (1982)
Article MathSciNet Google Scholar
di Serafino, D., Krejic, N., Krklec Jerinkic, N., Viola, M.: LSOS: Line-search Second-Order Stochastic optimization methods for nonconvex finite sums, arXiv:2007.15966v2, (2021)
Eisenstat, S.C., Walker, H.F.: Choosing the Forcing Terms in an Inexact Newton Method. SIAM Journal on Scientific Computing 17(1), 16–32 (1996)
Article MathSciNet Google Scholar
Fountoulakis, K., Gondzio, J.: A second order method for strongly convex $ \ell _1 - $ regularization problems. Mathematical Programming 156, 189–219 (2016)
Article MathSciNet Google Scholar
Franchini, G., Ruggiero, V., Zanni, L.: Ritz-like values in steplength selections for stochastic gradient methods. Soft Computing 24, 17573–17588 (2020)
Article Google Scholar
Franchini, G., Porta, F., Ruggiero, V., Trombini, I.: A line search based proximal stochastic gradient algorithm with dynamical variance reduction, Optimization On Line, http://www.optimization-online.org/DB_HTML/2022/02/8810.html, (2022)
Gratton, S., Toint, Ph.L.: A note on solving nonlinear optimization problems in variable precision. Computational Optimization and Applications 76, 917–933 (2020)
Article MathSciNet Google Scholar
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49(6), 409–436 (1952)
Article MathSciNet Google Scholar
Liu, Y., Roosta, F.: Convergence of Newton-MR under inexact hessian information. SIAM Journal on Optimization 31(1), 59–90 (2021)
Article MathSciNet Google Scholar
Maggiar, A., Wachter, A., Dolinskaya, I.S., Staum, J.: A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization 28, 1478–1507 (2018)
Article MathSciNet Google Scholar
More, J.J., Wild, S.M.: Estimating computational noise. SIAM Journal on Scientific Computing 33, 1292–1314 (2011)
Article MathSciNet Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer Science and Media (2013)
MATH Google Scholar
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17, 527–566 (2017)
Article MathSciNet Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research, Springer (1999)
Book Google Scholar
Paquette, C., Scheinberg, K.: A Stochastic Line Search Method with Expected Complexity Analysis. SIAM Journal of Optimization 30, 349–376 (2020)
Article MathSciNet Google Scholar
Roosta-Khorasani, F., Mahoney, M.W.: Sub-Sampled Newton Methods. Mathematical Programming 174, 293–326 (2019)
Article MathSciNet Google Scholar
Tropp, J.A.: An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning 8(1–2), 1–230 (2015)
Article Google Scholar
Xu, P., Roosta-Khorasani, F., Mahoney, M.W.: Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information. Mathematical Programming 184, 35–70 (2020)
Article MathSciNet Google Scholar

Download references

Acknowledgements

INdAM-GNCS partially supported the first and third authors under Progetti di Ricerca 2021.

Funding

Open access funding provided by Universitá degli Studi di Firenze within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Industrial Engineering, Università degli Studi di Firenze, Florence, Italy
S. Bellavia & B. Morini
Department of Mathematics and Computer Science “Ulisse Dini”, Università degli Studi di Firenze, Florence, Italy
E. Fabrizi

Authors

S. Bellavia
View author publications
You can also search for this author in PubMed Google Scholar
E. Fabrizi
View author publications
You can also search for this author in PubMed Google Scholar
B. Morini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Bellavia.

Ethics declarations

Conflict of interest

The authors have not conflict of interest to declare

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

S. Bellavia, E. Fabrizi, B. Morini: Member of the INdAM Research Group GNCS.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bellavia, S., Fabrizi, E. & Morini, B. Linesearch Newton-CG methods for convex optimization with noise. Ann Univ Ferrara 68, 483–504 (2022). https://doi.org/10.1007/s11565-022-00435-4

Download citation

Received: 30 April 2022
Accepted: 29 July 2022
Published: 17 August 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11565-022-00435-4

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Linesearch Newton-CG methods for convex optimization with noise

Abstract

Similar content being viewed by others

Inexact Reduced Gradient Methods in Nonconvex Optimization

Globally linearly convergent nonlinear conjugate gradients without Wolfe line search

A filter line search algorithm based on an inexact Newton method for nonconvex equality constrained optimization

1 Introduction

2 Our setting

Lemma 2.1

Proof

2.1 Assumptions

Assumption 2.2

Assumption 2.3

Assumption 2.4

Assumption 2.5

Assumption 2.6

3 Bounded noise on f

Lemma 3.1

Proof

3.1 Complexity analysis of the stochastic process

Definition 3.2

Theorem 3.3

Proof

Assumption 3.4

Lemma 3.5

Proof

Lemma 3.6

Proof

4 Decreasing noise on f

Lemma 4.1

Proof

4.1 Complexity analysis of the stochastic process

Lemma 4.2

Proof

Lemma 4.3

Proof

Theorem 4.4

Proof

4.2 Local convergence

Assumption 4.5

Lemma 4.6

Proof

5 Finite sum case

6 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation