1 Introduction

In this paper we consider the finite-sum minimization problem

$$\begin{aligned} \min _{x\in \mathbb {R}^n}f_N(x)= \frac{1}{N}\sum _{i=1}^N \phi _i(x), \end{aligned}$$
(1)

where N is very large and finite and \(\phi _i: \mathbb {R}^n\rightarrow \mathbb {R}\), \(1\le i\le N\), are continuously differentiable. A number of important problems can be stated in this form, e.g., classification problems in machine learning, data fitting problems, sample average approximations of an objective function given in the form of mathematical expectation. In recent years the need for efficient methods for solving (1) resulted in a large body of literature and a number of methods have been proposed and analyzed, see e.g., the reviews [1,2,3].

It is common to employ subsampled approximations of the objective function and its derivatives with the aim of reducing the computational cost. Focusing on first-order methods, the stochastic gradient [4] and more contemporary variants like SVRG [5, 6], SAG [7], ADAM [8] and SARAH [9] are widely used for their simplicity and low cost per-iteration. They do not call for function evaluations but require tuning the learning rate and further possible hyper-parameters such as the mini-batch size. Since the tuning effort may be very computationally demanding [10], more sophisticated approaches use stochastic linesearch or trust-region strategies to adaptively choose the learning rate, see [1, 10,11,12,13,14,15]. In this context, function and gradient approximations have to satisfy sufficient accuracy requirements with some probability. This, in turn, in case of approximations via sampling, requires adaptive choices of the sample sizes used.

In a further stream of works, problem (1) is reformulated as a constrained optimization problem and the sample size is computed deterministically using the Inexact Restoration (IR) approach. The IR approach has been successfully combined with either the linesearch strategy [16] or the trust-region strategy [17,18,19]; in these papers, function and gradient estimates are built with gradually increasing accuracy and averaging on the same sample.

We propose a novel trust-region method with random models based on the IR methodology. In our proposed method, feasibility and optimality are improved in a modular way, and the resulting procedure differs from the existing stochastic trust-region schemes [13, 14, 20,21,22] in the acceptance rule for the step. We provide a theoretical analysis and give a bound on the expected iteration complexity to satisfy an approximate first-order optimality condition; this calls for accuracy conditions on random gradients that are assumed to hold with some sufficiently large but fixed probability and are, in general, less stringent than the corresponding ones in [13, 14, 20,21,22]. Our theoretical analysis improves over the one for the stochastic trust-region method with inexact restoration given in [19], since we no longer rely on standard theory for deterministic unconstrained optimization invoked eventually when functions and gradients are computed exactly.

The paper is organized as follows. In Sect. 2 we give an overview of random models employed in the trust-region framework and introduce the main features of our contribution. The new algorithm is proposed in Sect. 3 and studied theoretically with respect to the iteration complexity analysis. Extensive numerical results are presented in Sect. 4.

2 Trust-region method with random models

Variants of the standard trust-region method based on the use of random models have been presented, to our knowledge, in [13, 14, 19,20,21,22,23]. They consist in the adaptation of the trust-region framework to the case where random estimates of the derivatives are introduced and function values are either computed exactly [20] or replaced by stochastic estimates [13, 14, 19, 21,22,23].

The computation and acceptance of the iterates parallel the standard trust-region mechanism, and the success of the procedure relies on function values and models being sufficiently accurate with fixed and large enough probability. The accuracy requests in the mentioned works show many similarities; here we illustrate some issues related to the works [13, 14, 22], which are closer to our approach.

Let \(\Vert \cdot \Vert \) denote the 2-norm throughout the paper. At iteration k of a first-order stochastic trust-region model, given \(x_k\), the positive trust-region radius \(\delta _k\) and a random approximation \(g_k\) of \(\nabla f_N(x_k)\), let consider the model

$$\begin{aligned} \varsigma _k(x_k+{p})=f_N(x_k)+ g_k^T { p} \end{aligned}$$

for \(f_N\) on \(B(x_k, \delta _k)=\{x\in {\mathbb {R}}^n: \Vert x-x_k\Vert \le \delta _k\}\) and the trust-region problem \(\min _{\Vert { p}\Vert \le \delta _k} \varsigma _k(x_k+ {p})\). Thus, the trust region step takes the form \( { p}_k=-\delta _k g_k/\Vert g_k\Vert \).

Two estimates \(f^{k,0}\) and \(f^{k, {p}}\) of \(f_N\) at \(x_k\) and \(x_k+ {p}_k\), respectively, are employed to either accept or reject the trial point \(x_k+ { p}_k\). The classical ratio between the actual and predicted reduction is replaced by

$$\begin{aligned} \rho _k=\displaystyle \frac{f^{k,0}-f^{k, { p}}}{\varsigma _k(x_k)-\varsigma _k(x_k+ { p}_k)}, \end{aligned}$$
(2)

and a successful iteration is declared when \(\rho _k \ge \eta _1\) and \( \Vert g_k\Vert \ge \eta _2 \delta _k\) for some constants \( \eta _1 \in (0,1) \) and positive and possibly large \( \eta _2 \). Note that the computation of both the step \( { p}_k\) and the denominator in (2) are independent of \(f_N(x_k)\). Furthermore, note that a successful iteration might not yield an actual reduction in \(f_N\) because the quantities involved in \(\rho _k\) are random approximations to the true value of the objective function.

The condition \( \Vert g_k\Vert \ge \eta _2 \delta _k\) is not typical of deterministic trust-region methods for smooth optimization and depends on the fact that \(\delta _k\) controls the accuracy of function and gradients. Specifically, the models used are required to be sufficiently accurate with some probability. The model \(\varsigma _k\) is supposed to be, \( {\pi }_M\)-probabilistically, a \(\kappa _*\)-fully linear model of \(f_N\) on the ball \(B(x_k, \delta _k)\), i.e., the requirement

$$\begin{aligned} |f_N(y)-\varsigma _k(y)|\le \kappa _* \delta _k^2, \quad \Vert \nabla f_N(y)-g_k\Vert \le \kappa _* \delta _k, \quad y \in B(x_k,\delta _k) \end{aligned}$$
(3)

with \(\kappa _*>0\), has to be fulfilled at least with probability \( {\pi }_M \in (0,1)\). Moreover, the estimates \(f^{k,0}\) and \(f^{k, {p}}\) are supposed to be \( {\pi }_f\)-probabilistically \(\epsilon _F\)-accurate estimates of \(f_N(x_k)\) and \(f_N(x_k+ {p}_k)\), i.e., the requirement

$$\begin{aligned} |f^{k,0} -f_N(x_k) |\le \epsilon _F \delta _k^2, \quad |f^{k, { p}}-f_N(x_k +{ p}_k)|\le \epsilon _F \delta _k^2, \end{aligned}$$
(4)

has to be fulfilled at least with probability \( {\pi }_f\in (0,1)\). Clearly, if \(f_N\) is computed exactly then condition (4) is trivially satisfied.

Convergence analysis in [13, 14, 22] shows that for \( {\pi }_M \) and \( {\pi }_f \) sufficiently large it holds \(\lim _{k\rightarrow \infty } \delta _k=0\) almost surely. Moreover, if \(f_N\) is bounded from below and \(\nabla f_N\) is Lipschitz continuous, then \(\lim _{k\rightarrow \infty } \Vert \nabla f_N(x_k)\Vert =0\) almost surely. Interestingly, the accuracy in (3) and (4) increases as the trust region radius gets smaller but the probabilities \( {\pi }_M\) and \( {\pi }_f\) are fixed.

For problem (1) it is straightforward to build approximations of \(f_N\) and \(\nabla f_N\) by sample average approximations

$$\begin{aligned} f_M(x) = \frac{1}{M}\sum _{i\in I_M} \phi _i(x),\qquad \nabla f_S(x) = \frac{1}{S}\sum _{i\in I_S} \nabla \phi _i(x), \end{aligned}$$
(5)

where \(I_M\) and \(I_S\) are subsets of \(\{1, \ldots , N\}\) of cardinality \(| I_M|=M\) and \(|I_S|=S\), respectively. The choice of sample size such that (3) and (4) hold in probability is discussed in [14, §5] as follows. Let \({\mathbb {E}}_{\omega }[|\phi _{\omega }(x )-f_N(x)|^2]\le V_f\), \({\mathbb {E}}_{\omega }[|\nabla \phi _{\omega }(x)-\nabla f_N(x)|^2]\le V_g\), \( \forall x\in \mathbb {R}^n\), with \({\mathbb {E}}_{\omega }\) being the expected value with respect to the random index \(\omega \) employed for sampling, and assume

$$\begin{aligned}&M\ge \frac{V_f}{\epsilon _F^2(1- {\pi }_f)\delta _k^4},\;\;\;\;\; S \ge \frac{V_g}{\kappa _*^2(1- {\pi }_g)\delta _k^2} \nonumber \\&\quad \text{ and } \max \{M,S\}\le N. \end{aligned}$$
(6)

Then \(f^{k,0}\) and \(f^{k,s}\) built as in (5) with sample size M satisfy (4) with probability \(p_f\), while \(g_k\) built as in (5) with sample size S satisfies \(\Vert \nabla f_N(x_k)-g_k\Vert \le \kappa _* \delta _k\) with probability \(p_g\). Furthermore, using Taylor expansion and Lipschitz continuity of \(\nabla f_N\), it can be proved that (3) is met with probability \( {\pi }_M= {\pi }_f { \pi }p_g\); consequently, a \(\kappa _*\)-fully linear model of \(f_N\) in \(B(x_k, \delta _k)\) is obtained.

In principle, conditions (3), (4) and \(\lim _{k\rightarrow \infty } \delta _k=0\) imply that \(f^{k,0}\), \(f^{k,s}\) and \(g_k\) will be computed at full precision for k sufficiently large. On the other hand, in applications such as machine learning, reaching full precision is unlikely since N is very large and termination is based on the maximum allowed computational effort or on the validation error.

2.1 Our contribution

We propose a trust-region procedure with random models based on (5) and combine it with the inexact restoration (IR) method for constrained optimization [24]. To this end, we make a simple transformation of (1) into a constrained problem. Specifically, letting \(I_M\) be an arbitrary nonempty subset of \(\{1, \ldots , N\}\) of cardinality \(|I_M|\) equal to M, we reformulate problem (1) as

$$\begin{aligned} \begin{aligned}&\min _{x\in \mathbb {R}^n} f_M(x) = \frac{1}{M}\sum _{i\in I_M} \phi _i(x),\\&\text{ s.t. } M=N. \end{aligned} \end{aligned}$$
(7)

Using the IR strategy allows to improve feasibility and optimality in a modular way and gives rise to a procedure that differs from the existing trust-region schemes in the following respects. First, at each iteration a reference sample size is fixed and used as a guess for the approximation of function values. Second, the acceptance rule for the step is based on the condition \(\Vert g_k\Vert \ge \eta _2 \delta _k\), for some \(\eta _2>0\), and a sufficient decrease condition on a merit function that measures both the reduction of the objective function and the improvement in feasibility. Finally, the expected iteration complexity to satisfy an approximate first-order optimality condition is given, provided that, at each iteration k, the gradient estimates satisfy accuracy requirements of order \({{{\mathcal {O}}}}\left( \delta _k\right) \); such accuracy requirements implicitly govern function approximations and are, in general, less stringent than the corresponding ones in [13, 14, 20,21,22], as carefully detailed in Sect. 3.

Our theoretical analysis improves over the analysis carried out in [19] for a similar stochastic trust-region coupled with inexact restoration, since here we do not rely on the occurrence of full precision, \(M=N\) in (7), reached eventually and do not apply standard theory for unconstrained optimization. In fact, the expected number of iterations until a prescribed accuracy is reached is provided without invoking full precision.

3 The algorithm

In this section we introduce our new algorithm referred to as SIRTR (Stochastic Inexact Restoration Trust Region).

First, we introduce some issues of IR methods. The level of infeasibility with respect to the constraint \(M=N\) in (7) is measured by the following function h.

Assumption 1

Let \(h:\{1,2,\ldots ,N\}\rightarrow \mathbb {R}\) be a monotonically decreasing function such that \(h(1)>0\), \(h(N)=0\).

This assumption implies that there exist some positive \({\underline{h}}\) and \({\overline{h}}\) such that

$$\begin{aligned} {\underline{h}}\le h(M) \ \ \text{ if } \ \ 1\le M<N, \quad \text{ and } \quad h(M) \le {\overline{h}} \ \ \text{ if } \ \ 1\le M\le N. \end{aligned}$$
(8)

One possible choice is \(h(M)=(N-M)/N, \ 1\le M\le N\).

The IR methods improve feasibility and optimality in modular way using a merit function to balance the progress. Since the reductions in the objective function and infeasibility might be achieved to a different degree, the IR method employs the merit function

$$\begin{aligned} \Psi (x,M,\theta )=\theta f_M(x)+(1-\theta )h(M), \end{aligned}$$
(9)

with \(\theta \in (0,1). \)

Our SIRTR algorithm is a trust-region method that employs first-order random models. At a generic iteration k, we fix a trial sample size \(N_{k+1}^t\) and build a linear model \(m_k(p)\) around \(x_k\) of the form

$$\begin{aligned} m_k(p)=f_{N_{k+1}^t}(x_k)+g_k^Tp, \end{aligned}$$
(10)

where \(g_k\) is a random estimator to \(\nabla f_N(x_k)\). Then, we consider the trust-region problem

$$\begin{aligned} \min _{ \Vert p\Vert \le \delta _k} m_k(p), \end{aligned}$$
(11)

whose solution is

$$\begin{aligned} p_k = - \delta _k \frac{g_k}{\Vert g_k\Vert }. \end{aligned}$$
(12)

As in standard trust-region methods, we distinguish between successful and unsuccessful iterations. However, we do not employ here the classical acceptance condition, but a more elaborate one that involves the merit function (9).

The proposed method is sketched in Algorithm 3.1 and its steps are now discussed. At a generic iteration k, we have at hand the outcome of the previous iteration: the iterate \(x_k\), the sample sizes \( N_k\) and \({\widetilde{N}}_{k}\), the penalty parameter \( \theta _k \), the flag iflag. If iflag=succ the previous iteration was successful, i.e., \(x_{k}=x_{k-1}+p_{k-1}\), if iflag=unsucc the previous iteration was unsuccessful, i.e., \(x_k=x_{k-1}\).

The scheduling procedure for generating the trial sample size \(N_{k+1}^t\) consists of Steps 1 and 2 of SIRTR. At Step 1, we determine a reference sample size \({\widetilde{N}}_{k+1}\le N\). If iflag=succ, then the infeasibility measure h is sufficiently decreased as stated in (20). If iflag=unsucc, \({\widetilde{N}}_{k+1}\) is left unchanged from the previous iteration, i.e., \({\widetilde{N}}_{k+1}={\widetilde{N}}_k\). We remark that (20) trivially implies \({\widetilde{N}}_{k+1}=N\) if \(N_k=N\) and that it holds at each iteration, even when it is not explicitly enforced at Step 1 (see forthcoming Lemma 1). In principle \({\widetilde{N}}_{k+1}\) could be the trial sample size but we aim at giving more freedom to the sample size selection process. Thus, at Step 2, we choose a trial sample size \( N_{k+1}^t\) complying with condition (21). On the one hand, such a condition allows the choice \( N_{k+1}^t< {\widetilde{N}}_{k+1}\) in order to reduce the computational effort; on the other hand, the choice \( N_{k+1}^t\ge {\widetilde{N}}_{k+1}\) is also possible in order to satisfy specific accuracy requirements that will be specified later. When \( N_{k+1}^t< {\widetilde{N}}_{k+1}\), condition (21) rules the largest possible distance between \(N_{k+1}^t\) and \({\widetilde{N}}_{k+1}\) in terms of \( \delta _k \); in case \( N_{k+1}^t\ge {\widetilde{N}}_{k+1}\), (21) is trivially satisfied.

At Step 3 we form the linear random model (10) and compute its minimizer within the trust-region. Specifically, we fix the cardinality \(N_{k+1,g}\) and choose the set of indices \(I_{N_{k+1,g}}\subseteq \{1, \ldots , N\}\) of cardinality \(N_{k+1,g}\). Then, we compute the estimator \(g_k\) of \(\nabla f_{N}(x_k)\) as

$$\begin{aligned} g_k= \frac{1}{N_{k+1,g}}\sum _{i\in I_{N_{k+1,g}}} \nabla \phi _i(x_k) \end{aligned}$$
(13)

and the solution \(p_k\) in (12) of the trust-region subproblem (11). Further, we compute \(m_k(p_k)\) where \(m_k\) is defined in (10) and

$$\begin{aligned} f_{N_{k+1}^t}(x_k)=\frac{1}{N_{k+1}^t}\sum _{i\in I_{N_{k+1}^t}} \phi _i(x_k), \end{aligned}$$
(14)

with \(I_{N_{k+1}^t}\subseteq \{1, \ldots , N\}\) being a set of cardinality \(N_{k+1}^t\).

At Step 4 we compute the new penalty term \( \theta _{k+1}. \) The computation relies on the predicted reduction defined as

$$\begin{aligned} \mathrm{{Pred}}_k(\theta )=\theta (f_{N_k}(x_k)-m_k(p_k))+(1-\theta )(h(N_k)-h({\widetilde{N}}_{k+1})), \end{aligned}$$
(15)

where \(\theta \in (0,1)\). This predicted reduction is a convex combination of the usual predicted reduction \(f_{N_k}(x_k)-m_k(p_k) \) in trust-region methods, and the predicted reduction \( h(N_k)-h({\widetilde{N}}_{k+1}) \) in infeasibility obtained in Step 1. The new parameter \( \theta _{k+1} \) is computed so that

$$\begin{aligned} \mathrm{{Pred}}_k(\theta ) \ge \eta _1 (h(N_k)-h({\widetilde{N}}_{k+1})). \end{aligned}$$
(16)

If (16) is satisfied at \(\theta =\theta _k\) then \(\theta _{k+1}=\theta _k\), otherwise \( \theta _{k+1} \) is computed as the largest value for which the above inequality holds (see forthcoming Lemma 2).

Step 5 establishes if the iteration is successful or not. To this end, given a point \( {\hat{x}} \) and \(\theta \in (0,1) \), the actual reduction of \(\Psi \) at the point \( {\hat{x}}\) has the form

$$\begin{aligned} \mathrm{{Ared}}_k({\hat{x}},\theta )= & {} \Psi (x_k, N_k, \theta )- \Psi ({\hat{x}}, N_{k+1}^t, \theta ) \nonumber \\= & {} \theta (f_{N_k}{(x_k)} -f_{N_{k+1}^t}({\hat{x}}))+(1-\theta )(h(N_k)-h(N_{k+1}^t)), \qquad \end{aligned}$$
(17)

and the iteration is successful whenever the following two conditions are both satisfied

$$\begin{aligned} \mathrm{{Ared}}_k(x_k+p_k,\theta _{k+1})&\ge \eta _1 \mathrm{{Pred}}_k(\theta _{k+1}) \end{aligned}$$
(18)
$$\begin{aligned} \Vert g_k\Vert&\ge \eta _2 \delta _k. \end{aligned}$$
(19)

Otherwise the iteration is declared unsuccessful. If the iteration is successful, we accept the step and the trial sample size, set iflag=succ and possibly increase the trust-region radius through (23); the upper bound \( \delta _{\max } \) on the trust region size is imposed in (23). In case of unsuccessful iterations, we reject both the step and the trial sample size, set iflag=unsucc and decrease the trust region size.

Concerning conditions (18) and (19), we observe that the former mimics the classical acceptance criterion of standard trust-region methods while the latter drives \(\delta _k\) to zero as \(\Vert g_k\Vert \) tends to zero.

figure a

We conclude the description of Algorithm 3.1 showing that condition (20) holds for all iterations, even when it is not explicitly enforced at Step 1.

Lemma 1

Let Assumption 3.1 holds and \(r\in (0,1)\) be the scalar in Algorithm 3.1. The sample sizes \({\widetilde{N}}_{k+1}\le N\) and \(N_k\le N\) generated by Algorithm 3.1 satisfy

$$\begin{aligned} h({\widetilde{N}}_{k+1})\le r h(N_k), \quad \forall k\ge 0 . \end{aligned}$$
(25)

Proof

We observe that, by Assumption 3.1, (25) trivially holds whenever \(N_k={\widetilde{N}}_{k+1}=N\).

Otherwise, we proceed by induction. Indeed, the thesis trivially holds for \(k=0\), as we set iflag=succ at the first iteration and enforce (25) at Step 1. Now consider a generic iteration \({\bar{k}}\ge 1\) and suppose that (25) holds for \({\bar{k}}-1\). If iteration \({\bar{k}}-1\) is successful, then condition (25) is enforced for iteration \(\bar{k}\) at Step 1.

If iteration \(\bar{k}-1\) is unsuccessful, then at Step 5 we set \(N_{{\bar{k}}}=N_{{\bar{k}}-1}\). Successively, at Step 1 of iteration \(\bar{k}\) we set \({\widetilde{N}}_{{\bar{k}}+1}={\widetilde{N}}_{{\bar{k}}}\). Since (25) holds by induction at iteration \({\bar{k}}-1\), we have \(h({\widetilde{N}}_{{\bar{k}}})\le r h(N_{{\bar{k}}-1})\), which can be rewritten as \(h({\widetilde{N}}_{{\bar{k}}+1})\le r h(N_{{\bar{k}}})\) due to the previous assignments at Step 5 and Step 1. Then condition (25) holds also at iteration \({\bar{k}}\). \(\square \)

3.1 On the sequences \(\{\theta _k\}\) and \( \{\delta _k\}\)

In this section, we analyze the properties of Algorithm 3.1. In particular, we prove that the sequence \(\{\theta _k\}\) is non increasing and uniformly bounded from below, and that the trust region radius \(\delta _k\) tends to 0 as \(k\rightarrow \infty \). We make the following assumption.

Assumption 2

Functions \(\phi _i\) are continuously differentiable for \(i=1,\ldots ,n\). There exists \(f_{low}\in \mathbb {R}\) such that

$$\begin{aligned} f_{M}(x)\ge f_{low}, \quad \ 1\le M\le N, \ x\in \mathbb {R}^n. \end{aligned}$$

Furthermore, there exist \(\Omega \subset {\mathbb {R}}^n\) and \(f_{up}\in \mathbb {R}\) such that

$$\begin{aligned} f_{M}(x)\le f_{up}, \quad \ 1\le M\le N, \ x\in \Omega , \end{aligned}$$

and all iterates generated by Algorithm 3.1 belong to \(\Omega \).

In the following, we let

$$\begin{aligned} \kappa _\phi = \max \{ |f_{low}|, |f_{up} |\}. \end{aligned}$$
(26)

Remark 1

In the context of machine learning, the above assumption is verified in several cases, e.g., the mean-squares loss function coupled with either the sigmoid, the softmax or the hyperbolic tangent activation function; the mean-squares loss function coupled with ReLU or ELU activation functions and bound constraints (above and below) on all variables; the logistic loss function coupled again with bound constraints (above and below) on the unknowns [25].

In the analysis that follows we will consider two options for \( {\hat{x}}\) in (17), \({\hat{x}} = x_k + p_k \) for successful iterations and \( {\hat{x}} = x_k \) for unsuccessful iterations.

Our first result characterizes the sequence \(\{\theta _k\}\) of the penalty parameters; the proof follows closely [19, Lemma 2.2].

Lemma 2

Let Assumptions 1 and 2 hold. Then the sequence \( \{\theta _k\} \) is positive, non increasing and bounded from below, \(\theta _{k+1}\ge {\underline{\theta }}>0\) with \({\underline{\theta }}\) independent of k and (16) holds with \(\theta =\theta _{k+1}\).

Proof

We note that \(\theta _0>0\) and proceed by induction assuming that \(\theta _k\) is positive. Due to Lemma 1, for all iterations k we have that \(N_k\le {\widetilde{N}}_{k+1}\) and that \(N_k={\widetilde{N}}_{k+1}\) if and only if \(N_k=N\). First consider the case where \(N_k={\widetilde{N}}_{k+1}\) (or equivalently \(N_k={\widetilde{N}}_{k+1}=N\)); then it holds \(h(N_k)-h({\widetilde{N}}_{k+1})=0\), and \(N_{k+1}^t=N\) by Step 2. Therefore, we have \(\mathrm{{Pred}}_k(\theta )=\theta \delta _k \Vert g_k\Vert >0\) for any positive \(\theta \), and (22) implies \(\theta _{k+1}=\theta _k \). Let us now consider the case \(N_k<{\widetilde{N}}_{k+1}\). If inequality \(\mathrm{{Pred}}_k(\theta _k) \ge \eta _1(h(N_k)-h({\widetilde{N}}_{k+1}))\) holds then (22) gives \(\theta _{k+1}=\theta _k\). Otherwise, we have

$$\begin{aligned} \theta _k \left( f_{N_k}(x_k)-m_k(p_k)-( h(N_k)-h({\widetilde{N}}_{k+1}) )\right) < {(\eta _1-1)\left( h(N_k)-h({\widetilde{N}}_{k+1})\right) } , \end{aligned}$$

and since the right hand-side is negative by assumption, it follows

$$\begin{aligned} f_{N_k}(x_k)-m_k(p_k)-(h(N_k)-h({\widetilde{N}}_{k+1}))<0. \end{aligned}$$

Consequently, \(\mathrm{{Pred}}_k(\theta )\ge \eta _1(h(N_k)-h({\widetilde{N}}_{k+1}))\) is satisfied if

$$\begin{aligned} \theta (f_{N_k}(x_k)-m_k(p_k)-(h(N_k)-h({\widetilde{N}}_{k+1})))\ge (\eta _1-1)(h(N_k)-h({\widetilde{N}}_{k+1})), \end{aligned}$$

i.e., if

$$\begin{aligned} \theta \le \theta _{k+1}{\mathop {=}\limits ^\mathrm{def}}\frac{(1-\eta _1)(h(N_k)-h({\widetilde{N}}_{k+1}))}{m_k(p_k)-f_{N_k}(x_k)+h(N_k)-h({\widetilde{N}}_{k+1})}. \end{aligned}$$

Hence \(\theta _{k+1}\) is the largest value satisfying (16) and \( \theta _{k+1} < \theta _k. \)

Let us now prove that \( \theta _{k+1} \ge {\underline{\theta }}. \) Note that by (25) and (8)

$$\begin{aligned} h(N_k)-h({\widetilde{N}}_{k+1})\ge (1-r)h(N_k)\ge (1-r) {\underline{h}}. \end{aligned}$$
(27)

Using (26)

$$\begin{aligned} m_k(p_k)-f_{N_k}(x_k)+h(N_k)-h({\widetilde{N}}_{k+1})\le & {} m_k(p_k)-f_{N_k}(x_k)+h(N_k)\\\le & {} f_{N_{k+1}^t}(x_k)-\delta _k \Vert g_k\Vert - f_{N_k}(x_k)+ {\overline{h}} \\\le & {} |f_{N_{k+1}^t}(x_k)-f_{N_k}(x_k) |+ {\overline{h}} \le 2 k_{\phi }+ {\overline{h}}, \end{aligned}$$

and \(\theta _{k+1}\) in (22) satisfies

$$\begin{aligned} \theta _{k+1} \ge {\underline{\theta }}=\frac{(1-\eta _1)(1-r) {\underline{h}}}{2 k_{\phi }+ {\overline{h}}}, \end{aligned}$$
(28)

which completes the proof. \(\square \)

In the following, we derive bounds for the actual reduction \(\mathrm{{Ared}}_k(x_{k+1},\theta _{k+1})\) in case of successful iterations and distinguish the iteration indexes k as below:

$$\begin{aligned} {{{\mathcal {I}}}}_1= & {} \{ k\ge 0 \text{ s.t. } N_k<{\widetilde{N}}_{k+1}\} , \end{aligned}$$
(29)
$$\begin{aligned} {{{\mathcal {I}}}}_2= & {} \{ k\ge 0 \text{ s.t. } N_k={\widetilde{N}}_{k+1}\} . \end{aligned}$$
(30)

Note that \({\mathcal {I}}_1,{\mathcal {I}}_2\) are disjoint and any iteration index k belongs to exactly one of these subsets. Moreover, (25) yields \({\widetilde{N}}_{k+1}=N_k= N_{k+1}^t=N\) for any \(k \in {\mathcal {I}}_2\).

Lemma 3

Let Assumptions 1-2 hold and suppose that iteration k is successful. If \(k \in {{{\mathcal {I}}}}_1\) then

$$\begin{aligned} \mathrm{{Ared}}_k(x_{k+1},\theta _{k+1}) \ge \frac{\eta _1^2(1-r){\underline{h}}}{\delta _{\max }^2}\delta _k^2. \end{aligned}$$
(31)

Otherwise,

$$\begin{aligned} \mathrm{{Ared}}_k(x_{k+1},\theta _{k+1}) \ge \eta _1\eta _2 \underline{\theta }\delta _k^2. \end{aligned}$$
(32)

Proof

Since iteration k is successful, \(x_{k+1}=x_k+p_k\) and (18) hold. Suppose \(k \in {{{\mathcal {I}}}}_1\). By (18) and (16)

$$\begin{aligned} \mathrm{{Ared}}_k( x_k + p_k ,\theta _{k+1}) \ge \eta _1 \mathrm{{Pred}}_k(\theta _{k+1}) \ge \eta _1^2 (h(N_k) - h({\widetilde{N}}_{k+1})). \end{aligned}$$

In virtue of Lemma 1 we have \( h(N_k) - h({\widetilde{N}}_{k+1}) \ge (1-r) h(N_k)\), hence we obtain

$$\begin{aligned} \mathrm{{Ared}}_k(x_k+p_k, \theta _{k+1}) \ge \eta _1^2 (1-r) h(N_k). \end{aligned}$$

Dividing and multiplying the right-hand side above by \(\delta _{k}^2\), applying the inequalities \( {\underline{h}}\le h(N_k)\), \(\delta _k\le \delta _{\max }\), we get (31).

Suppose \(k \in {{{\mathcal {I}}}}_2\). Then \(N_k={\widetilde{N}}_{k+1}\) and by the definition of \(\mathrm{{Pred}}_k(\theta _{k+1})\) and Lemma 2, we have

$$\begin{aligned} \mathrm{{Pred}}_k(\theta _{k+1}) = \theta _{k+1}(f_N(x_k) - m_k(p_k)) = \theta _{k+1} \delta _k \Vert g_k\Vert \ge {\underline{\theta }} \delta _k\Vert g_k\Vert , \end{aligned}$$

and therefore (18), (19) and Lemma 2 yield (32). \(\square \)

Let us now define a Lyapunov type function \( \Phi \) inspired by the paper [14]. Assumption 1 implies that \( h(N_k) \) is bounded from above while Assumption 2 implies that \( f_{N_k}(x) \) is bounded from below if \(x\in \Omega \). Thus, there exists a constant \( \Sigma \) such that

$$\begin{aligned} f_{N_k}(x) - h(N_k) + \Sigma \ge 0 , \quad x \in \Omega , \quad k\ge 0. \end{aligned}$$
(33)

Definition 1

Let \( v \in (0,1) \) be a fixed constant. For all \(k\ge 0\), we define

$$\begin{aligned} \phi _k{\mathop {=}\limits ^\mathrm{def}}\Phi (x_k,N_k,\theta _k, \delta _k) = v\left( \Psi (x_k,N_k,\theta _k) + \theta _k \Sigma \right) + (1-v) \delta _k^2, \end{aligned}$$
(34)

where \(\Psi \) is the merit function given in (9) and \(\Sigma \) is given in (33).

The choice of \(v\in (0,1)\) in the above definition will be specified below. First, note that \(\phi _k\) is bounded below for all \(k\ge 0\),

$$\begin{aligned} \phi _k\ge & {} v\left( \Psi (x_k,N_k,\theta _k) + \theta _k \Sigma \right) \nonumber \\\ge & {} v\left( \theta _k f_{N_k}(x_k)+(1-\theta _k)h(N_k)+\theta _k (-f_{N_k}(x_k)+h(N_k))\right) \nonumber \\\ge & {} v h(N_k) \ge 0. \end{aligned}$$
(35)

Second, adding and subtracting suitable terms, by the definition (34) and for all \(k\ge 0\), we have

$$\begin{aligned} \phi _{k+1} - \phi _{k}= & {} v\left( \theta _{k+1}f_{N_{k+1}}(x_{k+1}) + (1-\theta _{k+1})h(N_{k+1})\right) \nonumber \\&-v\left( \theta _{k}f_{N_{k}}(x_{k}) + (1-\theta _{k})h(N_{k})\right) + v(\theta _{k+1} - \theta _k)\Sigma \nonumber \\&+ (1-v)(\delta _{k+1}^2 - \delta _k^2) \nonumber \\= & {} v\left( \theta _{k+1}f_{N_{k+1}}(x_{k+1}) + (1-\theta _{k+1})h(N_{k+1})\right) \pm v \theta _{k+1} f_{N_{k}}(x_{k}) \nonumber \\&\pm v(1-\theta _{k+1}) h(N_k) \nonumber \\&-v\left( \theta _{k}f_{N_{k}}(x_{k}) + (1-\theta _{k})h(N_{k})\right) + v(\theta _{k+1} - \theta _k)\Sigma \nonumber \\&+ (1-v)(\delta _{k+1}^2 - \delta _k^2) \nonumber \\= & {} v\left( {\theta _{k+1}(f_{N_{k+1}}(x_{k+1})-f_{N_k}(x_k))+(1-\theta _{k+1})(h(N_{k+1})-h(N_k))}\right) \nonumber \\&+ v(\theta _{k+1} - \theta _k)(f_{N_k}(x_k) - h(N_k) +\Sigma ) +(1-v)(\delta _{k+1}^2 - \delta _k^2). \end{aligned}$$
(36)

If the iteration k is successful, then using (33), the monotonicity of \(\{\theta _k\}_{k\in {\mathbb {N}}}\) proved in Lemma 2, and the fact that \(N_{k+1}=N_{k+1}^t\), the equality (36) yields

$$\begin{aligned} \phi _{k+1} - \phi _{k}\le -v \mathrm{{Ared}}_k(x_{k+1},\theta _{k+1})+(1-v)(\delta _{k+1}^2 - \delta _k^2). \end{aligned}$$
(37)

Otherwise, if the iteration k is unsuccessful, then \(x_{k+1}=x_k\), \(N_{k+1}=N_k\) and thus the first quantity at the right-hand side of equality (36) is zero. Hence using again (33) and the monotonicity of \(\{\theta _k\}_{k\in {\mathbb {N}}}\), we obtain

$$\begin{aligned} \phi _{k+1} - \phi _{k}\le (1-v)(\delta _{k+1}^2 - \delta _k^2). \end{aligned}$$
(38)

Now we provide bounds for the change of \(\Phi \) along subsequent iterations and again distinguish the two cases \(k\in {{{\mathcal {I}}}}_1, {{{\mathcal {I}}}}_2\) stated in (29)-(30).

Lemma 4

Let Assumptions 1 and 2 hold.

  1. (i)

    If the iteration k is unsuccessful, then

    $$\begin{aligned} {\phi _{k+1} - \phi _k \le \chi _1 \delta _k^2, \qquad \chi _1= (1-v)\frac{1-\gamma ^2}{\gamma ^2}.} \end{aligned}$$
    (39)
  2. (ii)

    If the iteration k is successful and \(k\in {\mathcal {I}}_1\), then

$$\begin{aligned} {\phi _{k+1} - \phi _k \le \chi _2 \delta _k^2, \qquad \chi _2= \left( {-v\left( \frac{\eta _1^2(1-r){\underline{h}}}{\delta _{\max }^2}\right) }+(1-v)(\gamma ^2-1) \right) .} \end{aligned}$$
(40)

If the iteration k is successful and \(k\in {\mathcal {I}}_2\), then

$$\begin{aligned} {\phi _{k+1} - \phi _k \le \chi _3\delta _k^2, \qquad \chi _3= \left( - v \eta _1\eta _2 {{\underline{\theta }}} + (1-v)(\gamma ^2-1) \right) .} \end{aligned}$$
(41)

Proof

(i) If iteration k is unsuccessful, the updating rule (24) for \(\delta _{k+1}\) implies \(\delta _{k+1}= \delta _k/\gamma \). Thus, equation (38) directly yields (39).

(ii) If iteration k is successful, the updating rule (23) for \(\delta _{k+1}\) implies \(\delta _{k+1}\le \gamma \delta _k\). Thus combining (37) with Lemma 3 we obtain (40) and (41). \(\square \)

We are now ready to prove that a sufficient decrease condition holds for \(\Phi \) along subsequent iterations and that \(\delta _k\) tends to zero.

Theorem 1

Let Assumptions 1 and 2 hold. There exists \(\sigma >0\), depending on \(v\in (0,1)\) in (34), such that

$$\begin{aligned} \phi _{k+1}- \phi _k\le -\sigma \delta _k^2, \quad \text{ for } \text{ all } k\ge 0. \end{aligned}$$
(42)

Proof

In case of unsuccessful iterations, (39) provides a sufficient decrease \(\phi _{k+1}- \phi _k\) for any value of \(v\in (0,1)\). In case of successful iterations, \(\chi _2\) and \(\chi _3\) in (40) and (41) are both negative if

$$\begin{aligned} \max \left\{ {\frac{(\gamma ^2-1)\delta _{\max }^2}{\eta _1^2(1-r){\underline{h}}+(\gamma ^2-1)\delta _{\max }^2}},\frac{\gamma ^2-1}{\eta _1\eta _2 {{\underline{\theta }}}+\gamma ^2-1}\right\}<v<1. \end{aligned}$$
(43)

Therefore, if v is chosen as above and

$$\begin{aligned} \sigma =\min \{\chi _1,\, \chi _2,\, \chi _3\}, \end{aligned}$$
(44)

then (39)–(41) imply (42) and the proof is completed. \(\square \)

Theorem 2

Let Assumptions 1 and 2 hold. Then the sequence \(\{\delta _k\}\) in Algorithm 3.1 satisfies

$$\begin{aligned} \lim _{k\rightarrow \infty } \delta _k=0. \end{aligned}$$

Proof

Under the stated conditions Theorem 1 holds and summing up (42) for \(j=0,1,\ldots ,k-1\), we obtain

$$\begin{aligned} \phi _{k}-\phi _0= \sum _{j=0}^{k-1} ( \phi _{j+1}-\phi _j)\le -\sigma \sum _{j=0}^{k-1} \delta _j^2. \end{aligned}$$

Given that, by (35), \( \phi _k \) is bounded from below for all k,  we conclude that \( \sum _{j=0}^{\infty } \delta _j^2<\infty \), and hence \( \lim _{j\rightarrow \infty } \delta _j= 0.\) \(\square \)

3.2 Complexity analysis

Algorithm 3.1 generates a random process since the function estimates in (14) and gradient estimates in (13) are random. All random quantities are denoted by capital letters, while the use of small letters is reserved for their realizations. In particular, the iterates \(X_k\), the trust region radius \(\Delta _k\), the steps \(P_k\), the function estimates \(F_{N_{k+1}^t}(X_k),F_{N_{k+1}^t}(X_k+P_k)\), the gradient estimates \(G_k,\nabla F_{N_{k+1}^t}(X_k)\), and the value \(\Phi _k\) of the function \(\Phi \) in (34) at iteration k are random variables, while \(x_k\), \(\delta _k\), \(p_k\), \(f_{N_{k+1}^t}(x_k),f_{N_{k+1}^t}(x_k+p_k)\), \(g_k,\nabla f_{N_{k+1}^t}(x_k)\), \(\phi _k\) are their realizations.

In this section, our aim is to derive a bound on the expected number of iterations that occur in Algorithm 3.1 to reach a desired accuracy. We show that our algorithm is included into the stochastic framework given in [13, Section 2] and consequently we derive an upper bound on the expected value of the hitting time \({{{\mathcal {K}}}}_{\epsilon }\) defined below.

Definition 2

Given \(\epsilon >0\), the hitting time \({{{\mathcal {K}}}}_{\epsilon }\) is the random variable

$$\begin{aligned} {{{\mathcal {K}}}}_{\epsilon }=\min \{k\ge 0: \ \Vert \nabla f_N(X_k)\Vert \le \epsilon \}, \end{aligned}$$

i.e., \({{{\mathcal {K}}}}_{\epsilon }\) is the first iteration such that \(\Vert \nabla f_N(X_k)\Vert \le \epsilon \).

Our analysis relies on the assumption that \(g_k\) and \(\nabla f_{N_{k+1}^t}(x_k)\) are probabilistically accurate estimators of the true gradient at \(x_k\), in the sense that the events

$$\begin{aligned} {{{\mathcal {G}}}}_{k,1}= & {} \{ \Vert \nabla f_N(X_k)-G_k\Vert \le \nu \Delta _k \}, \end{aligned}$$
(45)
$$\begin{aligned} {{{\mathcal {G}}}}_{k,2}= & {} \{\Vert \nabla f_N(X_k) - \nabla f_{N_{k+1}^t}(X_k)\Vert \le \nu \Delta _k \}, \end{aligned}$$
(46)

are true at least with conditioned probability \(\pi _1\in (0,1) \) and \(\pi _2\in (0,1)\), respectively. Using the same terminology of [26, 27], we say that iteration k is true if both \({\mathcal {G}}_{k,1}\) and \({\mathcal {G}}_{k,2}\) are true. Furthermore, we introduce the two random variables

$$\begin{aligned} I_k=\mathbbm {1}_{{\mathcal {G}}_{k,1}}, \quad J_k=\mathbbm {1}_{{\mathcal {G}}_{k,2}}, \end{aligned}$$
(47)

where \(\mathbbm {1}_A\) denotes the indicator function of an event A.

Finally, we need the following additional assumptions.

Assumption 3

The gradients \( \nabla \phi _i\) are Lipschitz continuous with constant \(L_i\). Let \({L=\frac{1}{2}\max _{1\le i\le N} L_i }\).

Under Assumptions 2 and 3, the norm of the gradient estimates \(\Vert g_k\Vert \) is bounded, as shown below.

Lemma 5

Let Assumptions 2 and 3 hold. Then there exists \(g_{\max }\) such that

$$\begin{aligned} \Vert g_k\Vert \le g_{\max }, \quad k\ge 0, \end{aligned}$$
(48)

where \(g_{max}=\sqrt{8L\kappa _{\phi }}\) and \(\kappa _{\phi }\) is given in (26).

Proof

By Assumption 3, it easily follows that \(\nabla f_{N_{k+1,g}}\) is Lipschitz continuous on \({\mathbb {R}}^n\) with constant 2L. Then Assumption 2 and the descent lemma for continuously differentiable functions with Lipschitz continuous gradient [28, Proposition A.24] ensure that

$$\begin{aligned} f_{low}\le f_{N_{k+1,g}}(y)\le f_{N_{k+1,g}}(x)+\nabla f_{N_{k+1,g}}(x)^T(y-x)+L\Vert y-x\Vert ^2, \quad \forall \ x,y\in \mathbb {R}^n. \end{aligned}$$

Taking the minimum of the right-hand side with respect to y, we can also write

$$\begin{aligned} f_{low}\le \min \limits _{y\in {\mathbb {R}}^n}\zeta (y)\equiv f_{N_{k+1,g}}(x)+\nabla f_{N_{k+1,g}}(x)^T(y-x)+L\Vert y-x\Vert ^2, \quad \forall \ x\in \mathbb {R}^n. \end{aligned}$$

The minimum of \(\zeta (y)\) is attained at the point \(\bar{y}=\displaystyle x-\frac{1}{2L}\nabla f_{N_{k+1,g}}(x)\) and letting \(y=\bar{y}\) in the previous inequality, we get:

$$\begin{aligned} f_{low}\le f_{N_{k+1,g}}(x)-\frac{1}{2L}\Vert \nabla f_{N_{k+1,g}}(x)\Vert ^2+\frac{1}{4L}\Vert \nabla f_{N_{k+1,g}}(x)\Vert ^2, \quad \forall \ x\in \mathbb {R}^n, \end{aligned}$$

and equivalently \(\Vert \nabla f_{N_{k+1,g}}(x)\Vert ^2\le 4L(f_{N_{k+1,g}}(x)-f_{low})\) for all \(x\in \mathbb {R}^n\). Using again Assumption 2, we have \(f_{N_{k+1,g}}(x)-f_{low}\le |f_{N_{k+1,g}}(x)|+|f_{low}|\le 2\kappa _{\phi }\), and consequently

$$\begin{aligned} \Vert \nabla f_{N_{k+1,g}}(x)\Vert ^2\le 8L\kappa _{\phi }, \quad \forall \ x\in \mathbb {R}^n. \end{aligned}$$

Setting \(x=x_k\) and \(g_{max}=\sqrt{8L\kappa _{\phi }}\) in the previous inequality, we get the result. \(\square \)

First, we analyze the occurrence of successful iterations and show that the availability of accurate gradients has an impact on the acceptance of the trial steps. The following lemma establishes that if the iteration k is true and \(\delta _k\) is smaller than a certain threshold, then the iteration is successful. The analysis is presented for a single realization of Algorithm 3.1 and specializes for k in the sets \( {{{\mathcal {I}}}}_1\), \( {{{\mathcal {I}}}}_2\).

Lemma 6

Let Assumptions 13 hold and suppose that iteration k is true.

  1. (i)

    If \(k \in {{{\mathcal {I}}}}_1\), then the iteration is successful whenever

    $$\begin{aligned} \delta _k \le \min \left\{ \frac{\Vert g_k\Vert }{\eta _3} , \, \frac{\Vert g_k\Vert }{\eta _2} \right\} , \end{aligned}$$
    (49)

    where \(\eta _3=\frac{\delta _{\max }g_{\max }(\theta _0(2\nu +L)+(1-{\underline{\theta }})\mu )}{\eta _1(1-\eta _1)(1-r){\underline{h}}}\).

  2. (ii)

    If \(k \in {{{\mathcal {I}}}}_2\), then the iteration is successful whenever

    $$\begin{aligned} { \delta _k \le \min \left\{ \frac{ (1-\eta _1)\Vert g_k\Vert }{2\nu +L} ,\, \frac{\Vert g_k\Vert }{\eta _2} \right\} .} \end{aligned}$$
    (50)

Proof

From Assumption 3, it follows that \(\nabla f_{N_{k+1}^t}\) is Lipschitz continuous with constant 2L. Then,

$$\begin{aligned}&|m_k(p_k) - f_{ {N_{k+1}^t}}(x_k + p_k)|= \left|\int _{0}^{1} \left( g_k \pm \nabla f_{ {N_{k+1}^t}}(x_k) -\nabla f_{ {N_{k+1}^t}}(x_k + \tau p_k) \right) ^T p_k d \tau \right|\nonumber \\&\le \int _{0}^{1} \Vert g_k-\nabla f_{{N_{k+1}^t}}(x_k)\Vert \Vert p_k\Vert d \tau + \int _{0}^{1} 2L \tau \Vert p_k\Vert ^2 d \tau \nonumber \\&\le \int _{0}^{1} ({\Vert g_k-\nabla f_{N}(x_k)\Vert +\Vert \nabla f_N(x_k)-\nabla f_{{N_{k+1}^t}}(x_k)\Vert }) \Vert p_k\Vert d \tau \nonumber \\&+ \int _{0}^{1} 2L \tau \Vert p_k\Vert ^2 d \tau \end{aligned}$$
(51)

and, since \({{{\mathcal {G}}}}_{k,1}\) and \({{{\mathcal {G}}}}_{k,1}\) are both true, (45) and (46) yield

$$\begin{aligned} |m_k(p_k) - f_{{N_{k+1}^t}}(x_k + p_k)|\le ({2}\nu +L )\delta _k^2. \end{aligned}$$
(52)

Now, let us analyze condition (18) for successful iterations.

(i) If \(k \in {{{\mathcal {I}}}}_1\), by (15), (17) and (16) we obtain

$$\begin{aligned} \mathrm{{Ared}}_k(x_k+p_k,\theta _{k+1}) - \eta _1 \mathrm{{Pred}}_k(\theta _{k+1})= & {} (1-\eta _1)\mathrm{{Pred}}_k(\theta _{k+1}) \nonumber \\&\quad +\mathrm{{Ared}}_k(\theta _{k+1})-\mathrm{{Pred}}_k(\theta _{k+1}) \nonumber \\&= (1-\eta _1) \mathrm{{Pred}}_k(\theta _{k+1}) \nonumber \\&\quad + \theta _{k+1}(m_k(p_k)-f_{N_{k+1}^t}(x_{k}+p_k)) \nonumber \\&\quad + (1-\theta _{k+1})(h({\widetilde{N}}_{k+1})-h(N_{k+1}^t)) \nonumber \\&\ge \eta _1 (1-\eta _1) (h(N_k)-h({\widetilde{N}}_{k+1}))\nonumber \\&\quad +\theta _{k+1}(m_k(p_k) -f_{N_{k+1}^t}(x_{k}+p_k)) \nonumber \\&\quad +(1-\theta _{k+1})(h({\widetilde{N}}_{k+1})-h(N_{k+1}^t)). \end{aligned}$$
(53)

Using (52), (21) and \({{\underline{\theta }}}\le \theta _{k+1}\le \theta _0\), we also have

$$\begin{aligned} {\theta _{k+1}(f_{N_{k+1}^t}(x_{k}+p_k)-m_k(p_k)) }&\quad { + (1-\theta _{k+1})(h(N_{k+1}^t)-h({\widetilde{N}}_{k+1}))} \nonumber \\&\le (\theta _0(2\nu +L)+(1-{\underline{\theta }})\mu ) \delta _k^2 . \end{aligned}$$
(54)

Note that the combination of (25), (8), (23) and Lemma 5, guarantees that

$$\begin{aligned} h(N_k)-h({\widetilde{N}}_{k+1})\ge (1-r)h(N_k)\ge \frac{(1-r){\underline{h}}\delta _k\Vert g_k\Vert }{\delta _{\max }g_{\max }}. \end{aligned}$$
(55)

Then, from (53), (54), and (55), we have

$$\begin{aligned} \mathrm{{Ared}}_k(x_k+p_k,\theta _{k+1}) - \eta _1 \mathrm{{Pred}}_k(\theta _{k+1})&\ge \frac{\eta _1(1-\eta _1)(1-r){\underline{h}}\delta _k\Vert g_k\Vert }{\delta _{\max }g_{\max }} \\&\quad - { (\theta _0(2\nu +L)+(1-{\underline{\theta }})\mu ) \delta _k^2}. \end{aligned}$$

Combining this result with (19), the proof is complete.

(ii) Using (15), (17), \(k \in {{{\mathcal {I}}}}_2\), we have

$$\begin{aligned} \mathrm{{Ared}}_k(x_k+p_k,\theta _{k+1}) - \eta _1 \mathrm{{Pred}}_k(\theta _{k+1})= & {} (1-\eta _1)\mathrm{{Pred}}_k(\theta _{k+1}) \\&\quad +\mathrm{{Ared}}_k(\theta _{k+1})-\mathrm{{Pred}}_k(\theta _{k+1})\\&= (1-\eta _1) \theta _{k+1}\delta _k\Vert g_k\Vert \\&\quad + \theta _{k+1}(m_k(p_k)-{f_N}(x_k+p_k)) \end{aligned}$$

Using (52) we get

$$\begin{aligned} \mathrm{{Ared}}_k(x_k+p_k,\theta _{k+1}) - \eta _1 \mathrm{{Pred}}_k(\theta _{k+1})&\ge (1-\eta _1)\theta _{k+1} \delta _k\Vert g_k\Vert \nonumber \\&\quad -\ \ \theta _{k+1}(2\nu +L)\delta _k^2. \end{aligned}$$
(56)

Combining the above inequality with (19), we have proved that the iteration is successful whenever (50) holds. \(\square \)

We can now guarantee that a successful iteration k occurs whenever k is true, the prefixed accuracy \(\epsilon \) in Definition 2 has not been achieved at k, and \(\delta _k\) is below a certain threshold depending on \(\epsilon \). Again, the result is stated for a single realization of the algorithm.

Lemma 7

Let Assumptions 13 hold. Suppose that \(\Vert \nabla f_N(x_k)\Vert > \epsilon \), for some \(\epsilon >0\), the iteration k is true, and

$$\begin{aligned} \delta _k<\delta ^\dagger := \min \left\{ \frac{\epsilon }{2\nu }, \frac{\epsilon }{2\eta _2}, \frac{\epsilon }{2\eta _3} ,\frac{ \epsilon (1-\eta _1)}{2(2\nu +L)} \right\} . \end{aligned}$$
(57)

Then, iteration k is successful.

Proof

By \(\Vert \nabla f_N(x_k)\Vert >\epsilon \), the occurrence of \({{\mathcal {G}}_{k,1}}\) and (57), we have

$$\begin{aligned} \Vert g_{k}-\nabla f_N(x_k)\Vert \le \nu \delta _k <\frac{\epsilon }{2}, \end{aligned}$$

and this yields \(\Vert g_k\Vert \ge \frac{\epsilon }{2}\). Then, Lemma 6 implies that iteration k is successful. \(\square \)

We now proceed similarly to [13, Section 2] and analyse the random process \(\{(\Phi _k,\Delta _k,W_k)\}_{k\in {\mathbb {N}}}\) generated by Algorithm 3.1, where \(\Phi _k\) is the random variable whose realization is given in (34) and \(W_k\) is the random variable defined as

$$\begin{aligned} {\left\{ \begin{array}{ll} W_0 =1\\ W_{k+1}=2\left( I_kJ_k-\frac{1}{2}\right) , \quad k=0,1,\ldots \end{array}\right. } \end{aligned}$$
(58)

Clearly, \(W_k\) takes values \(\pm 1\). We denote with \({\mathbb {P}}_{k-1}(\cdot )\) and \({\mathbb {E}}_{k-1}(\cdot )\) the probability and expected value conditioned to the \(\sigma -\)algebra generated by \(F_{N_{1}^t}(X_0),\ldots ,F_{N_{k}^t}(X_{k-1})\), \(\nabla F_{N_{1}^t}(X_0),\ldots ,\nabla F_{N_{k}^t}(X_{k-1})\), \(G_0,\ldots ,G_{k-1}\). Then, we can prove the following result.

Lemma 8

Let Assumptions 13 hold, v as in (43), \(\delta ^\dagger \) as in (57) and \({{{\mathcal {K}}}}_{\epsilon }\) as in Definition 2. Suppose there exists some \(j_{\max }\ge 0\) such that \(\delta _{\max }=\gamma ^{j_{\max }}\delta _0\), and \(\delta _0>\delta ^{\dagger }\). Assume that the estimators \(G_k\) and \(\nabla f_{N_{k+1}^t}(X_k)\) are conditionally independent random variables, and the events \({{{\mathcal {G}}}}_{k,1},{{{\mathcal {G}}}}_{k,2}\) occur with sufficiently high probability, i.e.,

$$\begin{aligned} {\mathbb {P}}_{k-1}({{{\mathcal {G}}}}_{k,1})= \pi _1, \quad {\mathbb {P}}_{k-1}({{{\mathcal {G}}}}_{k,2})= \pi _2, \quad \text {and }{\pi _3}=\pi _1\pi _2>\frac{1}{2}. \end{aligned}$$
(59)

Then,

  1. (i)

    there exists \(\lambda >0\) such that \(\Delta _k\le \delta _0e^{\lambda \cdot j_{\max }}\) for all \(k\ge 0\);

  2. (ii)

    there exists a constant \( \delta _{\epsilon }=\delta _0e^{\lambda \cdot j_{\epsilon }}\) for some \(j_{\epsilon }\le 0\) such that, for all \(k\ge 0\),

    $$\begin{aligned} \mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}\Delta _{k+1}\ge \mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}\min \{\Delta _ke^{\lambda W_{k+1}}, \delta _{\epsilon } \}, \end{aligned}$$
    (60)

    where \(W_{k+1}\) satisfies

    $$\begin{aligned} {\mathbb {P}}_{k-1}(W_{k+1}=1)={\pi _3}, \quad {\mathbb {P}}_{k-1}(W_{k+1}=-1)=1-{\pi _3}; \end{aligned}$$
    (61)
  3. (iii)

    there exists a nondecreasing function \(\ell :[0,\infty )\rightarrow (0,\infty )\) and a constant \(\Theta >0\) such that, for all \(k\ge 0\),

    $$\begin{aligned} \mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}{\mathbb {E}}_{k-1}[\Phi _{k+1}]\le \mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}\Phi _k-\mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}\Theta \ell (\Delta _k). \end{aligned}$$
    (62)

Proof

The proof parallels that of [13, Lemma 7].

(i) Since \(\delta _{\max }=\gamma ^{j_{\max }}\delta _0\), we can set \(\lambda =\log (\gamma )>0\), and the thesis follows from Step 5 of Algorithm 3.1.

(ii) Let us set

$$\begin{aligned} { \delta _\epsilon =\frac{\epsilon }{\xi }}, \quad \text {where }\xi \ge \max \left\{ 2\nu ,2\eta _2, { 2\eta _3,}\frac{2(2\nu +L)}{1-\eta _1}\right\} , \end{aligned}$$
(63)

and assume that \( \delta _{\epsilon }=\gamma ^{j_{\epsilon }}\delta _0\), for some integer \(j_{\epsilon }\le 0\); notice that we can always choose \(\xi \) sufficiently large so that this is true. As a consequence, \(\Delta _k= \gamma ^{i_k} \delta _\epsilon \) for some integer \(i_k\).

When \(\mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}=0\), inequality (60) trivially holds. Otherwise, conditioning on \(\mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}=1\), we can prove that

$$\begin{aligned} \Delta _{k+1}\ge \min \{\delta _\epsilon ,\min \{\delta _{\max },\gamma \Delta _k\}I_kJ_k+\gamma ^{-1}\Delta _k(1-I_kJ_k)\}. \end{aligned}$$
(64)

Indeed, for any realization such that \(\delta _k>\delta _{\epsilon }\), we have \(\delta _k\ge \gamma \delta _{\epsilon }\) and because of Step 5, it follows that \(\delta _{k+1}\ge \delta _{\epsilon }\). Now let us consider a realization such that \(\delta _k\le \delta _{\epsilon }\). Since \({{{\mathcal {K}}}}_{\epsilon }>k\) and \(\delta _{\epsilon }\le \delta ^{\dagger }\), if \(I_kJ_k=1\) (i.e., k is true), then we can apply Lemma 7 and conclude that k is successful. Hence, by Step 5, we have \(\delta _{k+1}=\min \{\delta _{\max },\gamma \delta _k\}\). If \(I_kJ_k=0\), then we cannot guarantee that k is successful; however, again using Step 5, we can write \(\delta _{k+1}\ge \gamma ^{-1}\delta _k\). Combining these two cases, we get (64). If we observe that \(\delta _{\max }=\gamma ^{j_{\max }}\delta _0\ge { \gamma ^{j_{\epsilon }}\delta _{0}= \delta _{\epsilon }}\), and recall the definition of \(W_k\) in (58), then equation (64) easily yields (60). The probabilistic conditions (61) are a consequence of (59).

(iii) The thesis trivially follows from (42) with \(\ell (\Delta )=\Delta ^2\) and \(\Theta =\sigma \). \(\square \)

The previous lemma shows that the random process \(\{(\Phi _k,\Delta _k,W_k)\}_{k\in {\mathbb {N}}}\) complies with Assumption 2.1 of [13].

Theorem 3

Under the assumptions of Lemma 8, we have

$$\begin{aligned} { {\mathbb {E}}[{{{\mathcal {K}}}}_{\epsilon }]\le \frac{{\pi _3}}{2{\pi _3}-1}\cdot \frac{\phi _0\xi ^2}{{\sigma } \epsilon ^2}}+1. \end{aligned}$$
(65)

where \(\xi \) is chosen as in (63) and \(\sigma \) is given in (44).

Proof

The claim follows directly by [13, Theorem 2]. \(\square \)

Remark 2

The requirement of (45) and (46) to hold in probability is less stringent than the overall conditions (3) and (4). Analogously to the discussion in Sect. 2, if \({\mathbb {E}}_{\omega }[|\nabla \phi _{\omega }(x)-\nabla f_N(x)|^2]\le V_g\), then Chebyshev inequality guarantees that events (45) and (46) hold in probability when

$$\begin{aligned} \frac{V_g}{\nu ^2(1-\pi _1)\delta _k^2} \le N_{k+1,g} \le N , \quad \frac{V_g}{\nu ^2(1-\pi _2)\delta _k^2 }\le N_{k+1}^t \le N. \end{aligned}$$

Clearly, \(\min \{N_{k+1,g}, N_{k+1}^t\}={{{\mathcal {O}}}}(\delta _k^{-2})\) and in general these sample sizes are expected to growth slower than in (6).

Finally, the complexity theory presented improves on [19] where the iteration complexity before reaching full precision \(M=N\) in (7) is estimated, and thereafter existing iteration complexity results for trust-region methods applied to (1) are invoked.

4 Numerical experience

In this section, we evaluate the numerical performance of SIRTR on some nonconvex optimization problems arising in binary classification and regression.

All the numerical results have been obtained by running MATLAB R2019a on an Intel Core i7-4510U CPU 2.00-2.60 GHz with an 8 GB RAM. For all our tests, we equip SIRTR with \(\delta _0=1\) as the initial trust-region radius, \(\delta _{\max }=100\), \(\gamma =2\), \(\eta =10^{-1}\), \(\eta _2=10^{-6}\). Concerning the inexact restoration phase, we borrow the implementation details from [19]. Specifically, the infeasibility measure h and the initial penalty parameter \(\theta _0\) are set as follows:

$$\begin{aligned} h(M)=\frac{N-M}{N}, \quad \theta _0=0.9. \end{aligned}$$

The updating rule for choosing \({\widetilde{N}}_{k+1}\) has the form

$$\begin{aligned} {\widetilde{N}}_{k+1}=\min \{N,\lceil {\widetilde{c}} N_k\rceil \}, \end{aligned}$$
(66)

where \(1<{\widetilde{c}}<2\) is a prefixed constant factor; note that this choice of \({\widetilde{N}}_{k+1}\) satisfies (21) with \(r=(N-({\widetilde{c}}-1))/N\). At Step 2 the function sample size \(N_{k+1}^t\) is computed using the rule

$$\begin{aligned} {N_{k+1}^t}={\left\{ \begin{array}{ll} \lceil {\widetilde{N}}_{k+1}-\mu N \delta _k^2\rceil , \quad &{} \text {if }\lceil {\widetilde{N}}_{k+1}-\mu N\Delta _{k}^2\rceil \in [N_0,0.95N]\\ {\widetilde{N}}_{k+1}, \quad &{} \text {if }\lceil {\widetilde{N}}_{k+1}-\mu N\Delta _{k}^2\rceil < N_0\\ N, \quad &{} \text {if }\lceil {\widetilde{N}}_{k+1}-\mu N\Delta _{k}^2\rceil > 0.95 N. \end{array}\right. } \end{aligned}$$
(67)

Once the set \(I_{N_{k+1}^t}\) is fixed, the search direction \(g_k\in {\mathbb {R}}^n\) is computed via sampling as in (13) and the sample size \(N_{k+1,g}\) is fixed as

$$\begin{aligned} N_{k+1,g}=\lceil c N_{k+1}^t\rceil , \end{aligned}$$
(68)

with \(c\in (0,1]\) and \(I_{N_{k+1,g}}\subseteq I_{N_{k+1}^t}\).

4.1 SIRTR performance

In the following, we show the numerical behaviour of SIRTR on nonconvex binary classification problems. Let \(\{(a_i, b_i)\}_{i=1}^N\) denote the pairs forming a training set with \(a_i \in \mathbb {R}^n\) containing the entries of the i-th example, and \(b_i\in \{0, 1\}\) representing the corresponding label. Then, we address the following minimization problem

$$\begin{aligned} \min _{x\in \mathbb {R}^n}f_N(x)=\frac{1}{N}\sum _{i=1}^N \left( b_i-\frac{1}{1+e^{-a_i^Tx}} \right) ^2, \end{aligned}$$
(69)

where the nonconvex objective function \(f_N\) is obtained by composing a least-squares loss with the sigmoid function.

In Table 1, we report the information related to the datasets employed, including the number N of training examples, the dimension n of each example and the dimension \(N_T\) of the testing set \(I_{N_T}\).

Table 1 Data sets used

We focus on three aspects: the classification error provided by the final iterate, the computational cost, the occurrence of termination before full accuracy in function evaluations is reached. The last issue is crucial because it indicates the ability of the inexact restoration approach to solve (69) with random models and to rule sampling and steplength selection.

The average classification error provided by the final iterate, say \(x_{\mathrm{fin}}\), is defined as

$$\begin{aligned} \mathtt{err}=\frac{1}{N_T}\sum \limits _{i\in I_{N_T}}|b_i-b_i^{pred} |, \end{aligned}$$
(70)

where \(b_i\) is the exact label of the \(i-\)th instance of the testing set, and \(b_i^{pred}\) is the corresponding predicted label, given by \(b_i^{pred}={\text {max}}\{{\text {sign}}(a_i^Tx_{\mathrm{fin}}),0\}\).

The computational cost is measured in terms of full function and gradient evaluations. In our test problems, the main cost in the computation of \(\phi _i\), \(1\le i\le N\), is the scalar product \(a_i^Tx\): once this product is evaluated, it can be reused for computing \(\nabla \phi _i\). Nonetheless, following [32, Section 3.3], we count both function and gradient evaluations as if we were addressing a classification problem based on a neural net. Thus, computing a single function \(\phi _i\) requires \(\frac{1}{N}\) forward propagations, whereas the gradient evaluation corresponds to \(\frac{2}{N}\) propagations (an additional backward propagation is needed). Note that, once \(\phi _i\) is computed, the corresponding gradient \(\nabla \phi _i\) requires only \(\frac{1}{N}\) backward propagations. Hence, as in our implementation \(I_{N_{k+1,g}}\subseteq I_{N_{k+1}^t}\), the computational cost of SIRTR at each iteration k is determined by \(\frac{N_{k+1}^t+N_{k+1,g}}{N}\) propagations.

For all experiments in this section, we run SIRTR with \(x_0=(0,0,\ldots ,0)^T\) as initial guess, and stop it when either a maximum of 1000 iterations is reached or a maximum of 500 full function evaluations is performed or the condition

$$\begin{aligned} |f_{N_k}(x_k)-f_{N_{k-1}}(x_{k-1})|\le \epsilon |f_{N_{k-1}}(x_{k-1})|+\epsilon , \end{aligned}$$
(71)

with \(\epsilon =10^{-3}\), holds for a number of consecutive successful iterations such that the computational effort is equal to the effort needed in three iterations with full function and gradient evaluations.

Since the selection of sets \(I_{N_{k+1}^t}\) and \(I_{N_{k+1,g}}\) for computing \(f_{N_{k+1}^t}(x_k)\) and \(g_k\) is random, we perform 50 runs of SIRTR for each test problem. Results are reported in tables where the headings of the columns have the following meaning: cost is the overall number of full function and gradient evaluations averaged over the 50 runs,

err is the classification error given in (70) averaged over the 50 runs, sub the number of runs where the method is stopped before reaching full accuracy in function evaluations.

In a first set of experiments, we investigate the choice of \(N_{k+1,g}\) by varying the factor \(c\in (0,1]\) in (68). In particular, letting \({\widetilde{c}}=1.2\) in (66), \(\mu =100/N\) in (67) and \(N_0=\lceil 0.1N\rceil \) as in [19], we test the values \(c\in \{0.1,0.2,1\}\). The results obtained are reported in Table 2. We note that the classification error slightly varies with respect to the choice of \(N_{k+1,g}\), and that selecting \(N_{k+1,g}\) as a small fraction of \(N_{k+1}^t\) is quite convenient from a computationally point of view. By contrast, the choice \(N_{k+1,g}= N_{k+1}^t\) leads to the largest computational costs without providing a significant gain in accuracy. Besides the cost per iteration, equal to \(\frac{2N_{k+1}^t}{N}\) in this latter case, we observe that full accuracy in function evaluations is reached very often especially for certain datasets, see e.g., cina0, cod-rna, covertype, ijcnn1, phishing, real-sim. Remarkably, the results in Table 2 highlight that random models compare favourably with respect to cost and classification errors.

Table 2 Results with three different rules for computing the sample size \(N_{k+1,g}\)

Next, we show that SIRTR computational cost can be reduced by slowing down the growth rate of \(N_{k+1}^t\). This task can be achieved controlling the growth of \({\widetilde{N}}_{k+1}\) which affects \(N_{k+1}^t\) by means of (67). Letting \(c=0.1\), \(\mu =100/N\) and \(N_0=\lceil 0.1N\rceil \), we consider the choices \({\widetilde{c}}\in \{1.05,1.1,1.2\}\) in (66). Average results are reported in Table 3. We can observe that the fastest growth rate for \({\widetilde{N}}_{k+1}\) is generally more expensive than the other two choices, while the classification error is similar for all the three choices. Moreover, significantly for \({\widetilde{c}}= 1.05\) most runs stopped before reaching full function accuracy.

Table 3 Results with three different rules for computing the sample size \({\widetilde{N}}_{k+1}\)

We now analyze three different values, \(N_0\in \{\lceil 0.001N\rceil , \lceil 0.01N\rceil ,\lceil 0.1N\rceil \}\), for the initial sample size \(N_0\). We apply SIRTR with \(\tilde{c}=1.05\) in (66), \(\mu =100/N\) in (67), and \(c=0.1\) in (68). Results are reported in Table 4. We can see that, reducing \(N_0\), the number of full function/gradient evaluations can further reduce in some datasets, and that for \(N_0= \lceil 0.01N\rceil \) the average classification error compares well with the error when \(N_0= \lceil 0.1N\rceil \); for instance, the best results for most datasets are obtained by shrinking \(N_0\) to \(1\%\) of the maximum sample size. We conclude pointing out that most of the runs are performed without reaching full precision in function evaluation.

As a further confirmation of the efficiency of SIRTR, in Table 5 we report the sample sizes obtained on average at the stopping iteration of SIRTR with parameters setting \(N_0=\lceil 0.01 N\rceil \), \(N_{k+1,g}=\lceil 0.1 { N_{k+1}^t}\rceil \), \({\widetilde{N}}_{k+1}=\min \{N,\lceil 1.05 N_k\rceil \}\), \(\mu =100/N\). More specifically, for each dataset, we show the mean value \({\overline{N}}_{\mathrm{fin}}\) obtained by averaging the sample sizes \(N_{\mathrm{fin},i}\), \( 1\le i\le 50\), used at the final iteration of SIRTR, the relative standard deviation \(s=\frac{1}{{\overline{N}}_{\mathrm{fin}}} \sqrt{\frac{\sum _{i=1}^{50}(N_{\mathrm{fin},i}-{\overline{N}}_{\mathrm{fin}})^2}{50}}\) as a measure of dispersion of the final sample sizes with respect to the mean value, and the minimum and maximum sample sizes \(N_{\mathrm{fin}}^{\min },N_{\mathrm{fin}}^{\max }\) observed at the final iteration out of the 50 runs. From the reported values, we deduce that SIRTR terminates with a final sample size which is much smaller, on average, than the maximum sample size N.

Table 4 Results with three different initial sample sizes \(N_0\)
Table 5 Average sample size \({\overline{N}}_{\mathrm{fin}}\) obtained at the final iteration, relative standard deviation s, minimum and maximum sample sizes \(N_{\mathrm{fin}}^{\min },N_{\mathrm{fin}}^{\max }\) observed at the final iteration. Parameters setting: \(N_0=\lceil 0.01 N\rceil \), \(N_{k+1,g}=\lceil 0.1 { N_{k+1}^t}\rceil \), \({\widetilde{N}}_{k+1}=\min \{N,\lceil 1.05 N_k\rceil \}\), \(\mu =100/N\)

Finally, in Figs. 1, 2, we report the plots of the sample sizes \(N_{k+1}^t\) and \({\widetilde{N}}_{k+1}\) with respect to the number of iterations, obtained by running SIRTR on the a9a and mnist datasets, respectively. In particular, we let either \(\mu =100/N\) or \(\mu =1\) in the update rule (67), \(\tilde{c}=1.05\) in (66), \(c=0.1\) in (68) and \(N_0=\lceil 0.1 N\rceil \). Note that a larger \(\mu \) allows for the decreasing of both \(N_{k+1}^t\) and \({\widetilde{N}}_{k+1}\) in the first iterations, whereas a linear growth rate is imposed only in later iterations. This behaviour is due to the update condition (67), which naturally forces \(N_{k+1}^t\) to coincide with \({\widetilde{N}}_{k+1}\) when \(\delta _k\) is sufficiently small. For both choices of \(\mu \), we see that \(N_{k+1}^t\) can grow slower than \({\widetilde{N}}_{k+1}\) at some iterations, thus reducing the computational cost per iteration of SIRTR.

Fig. 1
figure 1

Dataset a9a. Samples sizes \( N_{k+1} \) and \({\widetilde{N}}_{k+1}\) versus iterations with \(\mu =100/N\) (left) and \(\mu =1\) (right), respectively, obtained with a single run of SIRTR. Classification errors: err = 0.187 with \(\mu =100/N\), err = 0.174 with \(\mu =1\)

Fig. 2
figure 2

Dataset mnist. Samples sizes \(N_{k+1}\) and \({\widetilde{N}}_{k+1}\) versus iterations with \(\mu =100/N\) (left) and \(\mu =1\) (right), respectively, obtained with a single run of SIRTR. Classification errors: err = 0.154 with \(\mu =100/N\), err = 0.167 with \(\mu =1\)

4.2 Comparison with TRish

In this section we compare the performance of SIRTR with the so-called Trust-Region-ish algorithm (TRish) recently proposed in [33]. TRish is a stochastic gradient method based on a trust-region methodology. Normalized steps are used in a dynamic manner whenever the norm of the stochastic gradient is within a prefixed interval. In particular, the \(k-\)th iteration of TRish is given by

$$\begin{aligned} x_{k+1}=x_k-{\left\{ \begin{array}{ll} \gamma _{1,k}\alpha _kg_k, \quad &{} \text {if } \Vert g_k\Vert \in \left[ 0,\frac{1}{\gamma _{1,k}}\right) \\ \alpha _k\frac{g_k}{\Vert g_k\Vert }, \quad &{} \text {if }\Vert g_k\Vert \in \left[ \frac{1}{\gamma _{1,k}},\frac{1}{\gamma _{2,k}} \right] \\ \gamma _{2,k}\alpha _kg_k, \quad &{}\text {if }\Vert g_k\Vert \in \left( \frac{1}{\gamma _{2,k}},\infty \right) \end{array}\right. } \end{aligned}$$

where \(\alpha _k>0\) is the steplength parameter, \(0<\gamma _{2,k}<\gamma _{1,k}\) are positive constants, and \(g_k\in {\mathbb {R}}^{n}\) is a stochastic gradient estimate. This algorithm has proven to be particularly effective on binary classification and neural network training, especially if compared with the standard stochastic gradient algorithm [33, Section 4].

For our numerical tests, we implement TRish with subsampled gradients \(g_k=\nabla f_{S}(x_k)\) defined in (5). The steplength is constant, \(\alpha _k=\alpha \), \(\forall k\ge 0\), and \(\alpha \) is chosen in the set \( \{10^{-3},10^{-1},\sqrt{10^{-1}},1,\sqrt{10}\}\). Following the procedure in [33, Section 4], we use constant parameters \(\gamma _{1,k}\equiv \gamma _1\), \(\gamma _{2,k}\equiv \gamma _2\) and select \(\gamma _1,\, \gamma _2\) as follows. First, Stochastic Gradient algorithm [4] is run with constant steplength equal to 1; second, the average norm G of stochastic gradient estimates throughout the runs is computed; third \(\gamma _1,\, \gamma _2\) are set as \(\gamma _1=\frac{4}{G}\), \(\gamma _2=\frac{1}{2G}\).

Fig. 3
figure 3

From top to bottom row: datasets a9a, htru2, mnist, phishing. From left to right: Average classification error, testing loss, and training loss versus epochs

First, we compare TRish with SIRTR on the nonconvex optimization problem (69), using a9a, htru2, mnist, and phishing as datasets (see Table 1). Based on the previous section, we equip SIRTR with \(N_0=\lceil 0.01 N\rceil \), \(N_{k+1,g}=\lceil 0.1 { N_{k+1}^t}\rceil \), \({\widetilde{N}}_{k+1}=\min \{N,\lceil 1.05 N_k\rceil \}\), \(\mu =100/N\). In TRish, the sample size S of the stochastic gradient estimates is \(\lceil 0.01N\rceil \), which corresponds to the first sample size used in SIRTR. We run each algorithm for ten epochs on the datasets a9a and htru2 using the null initial guess. We perform 10 runs to report results on average.

After tuning, the parameter setting for TRish was \( \gamma _1 \approx 34.5805\), \( \gamma _2 \approx 4.3226\) for a9a, \( \gamma _1 \approx 57.9622\), \( \gamma _2 \approx 7.2453\) for htru2, \( \gamma _1 \approx 23.4376\), \( \gamma _2 \approx 2.9297\) for mnist, and \(\gamma _1 \approx 50.6409\), \( \gamma _2 \approx 6.3301\) for phishing. In Fig. 3, we report the decrease of the (average) classification error, training loss \(f_N\) and testing loss, \(f_{N_T}(x)= \frac{1}{N_T}\sum _{i\in I_{N_T}} \phi _i(x)\), over the (average) number of full function and gradient evaluations required by the algorithms. From these plots, we can see that SIRTR performs comparably to the best implementations of TRish on a9a, htru2, mnist, while showing a good, though not optimal, performance on phishing.

In accordance to the experience in [33], all parameters \(\gamma _1\) and \(\gamma _2\) and \(\alpha \) are problem-dependent. For instance, the best performance of TRish is obtained with \(\alpha =10^{-1}\) for a9a and with \(\alpha =10^{-3}\) for htru2, respectively; by contrast, SIRTR performs well with an unique setting of the parameters which is the key feature of adaptive stochastic optimization methods.

As a second test, we compare the performance of SIRTR and TRish on a different nonconvex optimization problem arising from nonlinear regression. Letting \(\{(a_i,b_i)\}_{i=1}^N\) denote the training set, where \(a_i\in \mathbb {R}^n\) and \(b_i\in \mathbb {R}\) represent the feature vector and the target variable of the i-th example, respectively, we aim at solving the following problem

$$\begin{aligned} \min _{x\in \mathbb {R}^n}f_N(x)=\frac{1}{N}\sum _{i=1}^N \left( b_i-h(a_i;x) \right) ^2, \end{aligned}$$
(72)

where \(h(\cdot ;x):\mathbb {R}^n\rightarrow \mathbb {R}\) is a nonlinear prediction function.

For this second test, we use the air dataset [29], which contains 9358 instances of (hourly averaged) concentrations of polluting gases, as well as temperatures and relative/absolute air humidity levels, recorded at each hour in the period March 2004 - February 2005 from a device located in a polluted area within an Italian city.

As in [34], our goal is to predict the benzene (C6H6) concentration from the knowledge of \(n=7\) features, including carbon monoxide (CO), nitrogen oxides (NO\(_x\)), ozone (O\(_3\)), non-metanic hydrocarbons (NMHC), nitrogen dioxide (NO\(_2\)), air temperature, and relative air humidity. First, we preprocess the dataset by removing examples for which the benzene concentration is missing, reducing the dataset dimension from 9357 to 8991. Then, we employ \(70\%\) of the dataset for training (\(N=6294\)), and the remaining \(30\%\) for testing (\(N_T=2697\)). Since the concentration values have been recorded hourly, this means that we use the data measured in the first 9 months for the training phase, and the data related to the last 3 months for the testing phase. Finally, denoting with \(D=(d_{ij})\in \mathbb {R}^{(N+N_T)\times n}\) the matrix containing all the dataset examples along its rows, and setting

$$\begin{aligned} {\left\{ \begin{array}{ll} m_j = \min \limits _{i=1,\ldots ,N+N_{T}} {d_{ij}}, \\ M_j = \max \limits _{i=1,\ldots ,N+N_{T}} {d_{ij}} \end{array}\right. }, \quad j =1,\ldots ,n, \end{aligned}$$

we scale all data values into the interval [0, 1] as follows

$$\begin{aligned} d_{ij} = \frac{d_{ij}-m_{j}}{M_j-m_j}, \quad i=1,\ldots , N+N_T, \ j=1,\ldots ,n. \end{aligned}$$

We apply SIRTR and TRish on problem (72), where the prediction function \(h(\cdot ;x)\) is chosen as a feed-forward neural network based on a \(7\times 5 \times 1\) architecture (see [34] and references therein), with the two hidden layers both equipped with the linear activation function, and the output layer with the sigmoid activation function. We equip the two algorithms with the same parameter values employed in the previous tests, and run them 10 times for 10 epochs, using a random initial guess in the interval \([-\frac{1}{2},\frac{1}{2}]\).

In Fig. 4, we report the decrease of the (average) training and testing losses provided by SIRTR and by TRish with different choices of the steplength \(\alpha \), whereas in Fig. 5 we show the benzene concentration estimations provided by the algorithms against the true concentration. These results confirm that the performances of SIRTR are comparable with those of TRish equipped with the best choice of the steplength and show the ability of SIRTR to automatically tune the steplength so as to obtain satisfactory results in terms of testing and training accuracy.

Fig. 4
figure 4

Dataset air. Average testing loss (left) and training loss (right) versus epochs

Fig. 5
figure 5

Dataset air. Estimated concentrations during 10 days (240 hours) compared to the true concentration (black solid line)

5 Conclusions

We proposed a stochastic gradient method coupled with a trust-region strategy and an inexact restoration approach for solving finite-sum minimization problems. Functions and gradients are subsampled and the batch size is governed by the inexact restoration approach and the trust-region acceptance rule. We showed the theoretical properties of the method and gave a worst-case complexity result on the expected number of iterations required to reach an approximate first-order optimality point. Numerical experience showed that the proposed method provides good results keeping the overall computational cost relatively low.