Abstract
We propose a stochastic firstorder trustregion method with inexact function and gradient evaluations for solving finitesum minimization problems. Using a suitable reformulation of the given problem, our method combines the inexact restoration approach for constrained optimization with the trustregion procedure and random models. Differently from other recent stochastic trustregion schemes, our proposed algorithm improves feasibility and optimality in a modular way. We provide the expected number of iterations for reaching a nearstationary point by imposing some probability accuracy requirements on random functions and gradients which are, in general, less stringent than the corresponding ones in literature. We validate the proposed algorithm on some nonconvex optimization problems arising in binary classification and regression, showing that it performs well in terms of cost and accuracy, and allows to reduce the burdensome tuning of the hyperparameters involved.
Similar content being viewed by others
1 Introduction
In this paper we consider the finitesum minimization problem
where N is very large and finite and \(\phi _i: \mathbb {R}^n\rightarrow \mathbb {R}\), \(1\le i\le N\), are continuously differentiable. A number of important problems can be stated in this form, e.g., classification problems in machine learning, data fitting problems, sample average approximations of an objective function given in the form of mathematical expectation. In recent years the need for efficient methods for solving (1) resulted in a large body of literature and a number of methods have been proposed and analyzed, see e.g., the reviews [1,2,3].
It is common to employ subsampled approximations of the objective function and its derivatives with the aim of reducing the computational cost. Focusing on firstorder methods, the stochastic gradient [4] and more contemporary variants like SVRG [5, 6], SAG [7], ADAM [8] and SARAH [9] are widely used for their simplicity and low cost periteration. They do not call for function evaluations but require tuning the learning rate and further possible hyperparameters such as the minibatch size. Since the tuning effort may be very computationally demanding [10], more sophisticated approaches use stochastic linesearch or trustregion strategies to adaptively choose the learning rate, see [1, 10,11,12,13,14,15]. In this context, function and gradient approximations have to satisfy sufficient accuracy requirements with some probability. This, in turn, in case of approximations via sampling, requires adaptive choices of the sample sizes used.
In a further stream of works, problem (1) is reformulated as a constrained optimization problem and the sample size is computed deterministically using the Inexact Restoration (IR) approach. The IR approach has been successfully combined with either the linesearch strategy [16] or the trustregion strategy [17,18,19]; in these papers, function and gradient estimates are built with gradually increasing accuracy and averaging on the same sample.
We propose a novel trustregion method with random models based on the IR methodology. In our proposed method, feasibility and optimality are improved in a modular way, and the resulting procedure differs from the existing stochastic trustregion schemes [13, 14, 20,21,22] in the acceptance rule for the step. We provide a theoretical analysis and give a bound on the expected iteration complexity to satisfy an approximate firstorder optimality condition; this calls for accuracy conditions on random gradients that are assumed to hold with some sufficiently large but fixed probability and are, in general, less stringent than the corresponding ones in [13, 14, 20,21,22]. Our theoretical analysis improves over the one for the stochastic trustregion method with inexact restoration given in [19], since we no longer rely on standard theory for deterministic unconstrained optimization invoked eventually when functions and gradients are computed exactly.
The paper is organized as follows. In Sect. 2 we give an overview of random models employed in the trustregion framework and introduce the main features of our contribution. The new algorithm is proposed in Sect. 3 and studied theoretically with respect to the iteration complexity analysis. Extensive numerical results are presented in Sect. 4.
2 Trustregion method with random models
Variants of the standard trustregion method based on the use of random models have been presented, to our knowledge, in [13, 14, 19,20,21,22,23]. They consist in the adaptation of the trustregion framework to the case where random estimates of the derivatives are introduced and function values are either computed exactly [20] or replaced by stochastic estimates [13, 14, 19, 21,22,23].
The computation and acceptance of the iterates parallel the standard trustregion mechanism, and the success of the procedure relies on function values and models being sufficiently accurate with fixed and large enough probability. The accuracy requests in the mentioned works show many similarities; here we illustrate some issues related to the works [13, 14, 22], which are closer to our approach.
Let \(\Vert \cdot \Vert \) denote the 2norm throughout the paper. At iteration k of a firstorder stochastic trustregion model, given \(x_k\), the positive trustregion radius \(\delta _k\) and a random approximation \(g_k\) of \(\nabla f_N(x_k)\), let consider the model
for \(f_N\) on \(B(x_k, \delta _k)=\{x\in {\mathbb {R}}^n: \Vert xx_k\Vert \le \delta _k\}\) and the trustregion problem \(\min _{\Vert { p}\Vert \le \delta _k} \varsigma _k(x_k+ {p})\). Thus, the trust region step takes the form \( { p}_k=\delta _k g_k/\Vert g_k\Vert \).
Two estimates \(f^{k,0}\) and \(f^{k, {p}}\) of \(f_N\) at \(x_k\) and \(x_k+ {p}_k\), respectively, are employed to either accept or reject the trial point \(x_k+ { p}_k\). The classical ratio between the actual and predicted reduction is replaced by
and a successful iteration is declared when \(\rho _k \ge \eta _1\) and \( \Vert g_k\Vert \ge \eta _2 \delta _k\) for some constants \( \eta _1 \in (0,1) \) and positive and possibly large \( \eta _2 \). Note that the computation of both the step \( { p}_k\) and the denominator in (2) are independent of \(f_N(x_k)\). Furthermore, note that a successful iteration might not yield an actual reduction in \(f_N\) because the quantities involved in \(\rho _k\) are random approximations to the true value of the objective function.
The condition \( \Vert g_k\Vert \ge \eta _2 \delta _k\) is not typical of deterministic trustregion methods for smooth optimization and depends on the fact that \(\delta _k\) controls the accuracy of function and gradients. Specifically, the models used are required to be sufficiently accurate with some probability. The model \(\varsigma _k\) is supposed to be, \( {\pi }_M\)probabilistically, a \(\kappa _*\)fully linear model of \(f_N\) on the ball \(B(x_k, \delta _k)\), i.e., the requirement
with \(\kappa _*>0\), has to be fulfilled at least with probability \( {\pi }_M \in (0,1)\). Moreover, the estimates \(f^{k,0}\) and \(f^{k, {p}}\) are supposed to be \( {\pi }_f\)probabilistically \(\epsilon _F\)accurate estimates of \(f_N(x_k)\) and \(f_N(x_k+ {p}_k)\), i.e., the requirement
has to be fulfilled at least with probability \( {\pi }_f\in (0,1)\). Clearly, if \(f_N\) is computed exactly then condition (4) is trivially satisfied.
Convergence analysis in [13, 14, 22] shows that for \( {\pi }_M \) and \( {\pi }_f \) sufficiently large it holds \(\lim _{k\rightarrow \infty } \delta _k=0\) almost surely. Moreover, if \(f_N\) is bounded from below and \(\nabla f_N\) is Lipschitz continuous, then \(\lim _{k\rightarrow \infty } \Vert \nabla f_N(x_k)\Vert =0\) almost surely. Interestingly, the accuracy in (3) and (4) increases as the trust region radius gets smaller but the probabilities \( {\pi }_M\) and \( {\pi }_f\) are fixed.
For problem (1) it is straightforward to build approximations of \(f_N\) and \(\nabla f_N\) by sample average approximations
where \(I_M\) and \(I_S\) are subsets of \(\{1, \ldots , N\}\) of cardinality \( I_M=M\) and \(I_S=S\), respectively. The choice of sample size such that (3) and (4) hold in probability is discussed in [14, §5] as follows. Let \({\mathbb {E}}_{\omega }[\phi _{\omega }(x )f_N(x)^2]\le V_f\), \({\mathbb {E}}_{\omega }[\nabla \phi _{\omega }(x)\nabla f_N(x)^2]\le V_g\), \( \forall x\in \mathbb {R}^n\), with \({\mathbb {E}}_{\omega }\) being the expected value with respect to the random index \(\omega \) employed for sampling, and assume
Then \(f^{k,0}\) and \(f^{k,s}\) built as in (5) with sample size M satisfy (4) with probability \(p_f\), while \(g_k\) built as in (5) with sample size S satisfies \(\Vert \nabla f_N(x_k)g_k\Vert \le \kappa _* \delta _k\) with probability \(p_g\). Furthermore, using Taylor expansion and Lipschitz continuity of \(\nabla f_N\), it can be proved that (3) is met with probability \( {\pi }_M= {\pi }_f { \pi }p_g\); consequently, a \(\kappa _*\)fully linear model of \(f_N\) in \(B(x_k, \delta _k)\) is obtained.
In principle, conditions (3), (4) and \(\lim _{k\rightarrow \infty } \delta _k=0\) imply that \(f^{k,0}\), \(f^{k,s}\) and \(g_k\) will be computed at full precision for k sufficiently large. On the other hand, in applications such as machine learning, reaching full precision is unlikely since N is very large and termination is based on the maximum allowed computational effort or on the validation error.
2.1 Our contribution
We propose a trustregion procedure with random models based on (5) and combine it with the inexact restoration (IR) method for constrained optimization [24]. To this end, we make a simple transformation of (1) into a constrained problem. Specifically, letting \(I_M\) be an arbitrary nonempty subset of \(\{1, \ldots , N\}\) of cardinality \(I_M\) equal to M, we reformulate problem (1) as
Using the IR strategy allows to improve feasibility and optimality in a modular way and gives rise to a procedure that differs from the existing trustregion schemes in the following respects. First, at each iteration a reference sample size is fixed and used as a guess for the approximation of function values. Second, the acceptance rule for the step is based on the condition \(\Vert g_k\Vert \ge \eta _2 \delta _k\), for some \(\eta _2>0\), and a sufficient decrease condition on a merit function that measures both the reduction of the objective function and the improvement in feasibility. Finally, the expected iteration complexity to satisfy an approximate firstorder optimality condition is given, provided that, at each iteration k, the gradient estimates satisfy accuracy requirements of order \({{{\mathcal {O}}}}\left( \delta _k\right) \); such accuracy requirements implicitly govern function approximations and are, in general, less stringent than the corresponding ones in [13, 14, 20,21,22], as carefully detailed in Sect. 3.
Our theoretical analysis improves over the analysis carried out in [19] for a similar stochastic trustregion coupled with inexact restoration, since here we do not rely on the occurrence of full precision, \(M=N\) in (7), reached eventually and do not apply standard theory for unconstrained optimization. In fact, the expected number of iterations until a prescribed accuracy is reached is provided without invoking full precision.
3 The algorithm
In this section we introduce our new algorithm referred to as SIRTR (Stochastic Inexact Restoration Trust Region).
First, we introduce some issues of IR methods. The level of infeasibility with respect to the constraint \(M=N\) in (7) is measured by the following function h.
Assumption 1
Let \(h:\{1,2,\ldots ,N\}\rightarrow \mathbb {R}\) be a monotonically decreasing function such that \(h(1)>0\), \(h(N)=0\).
This assumption implies that there exist some positive \({\underline{h}}\) and \({\overline{h}}\) such that
One possible choice is \(h(M)=(NM)/N, \ 1\le M\le N\).
The IR methods improve feasibility and optimality in modular way using a merit function to balance the progress. Since the reductions in the objective function and infeasibility might be achieved to a different degree, the IR method employs the merit function
with \(\theta \in (0,1). \)
Our SIRTR algorithm is a trustregion method that employs firstorder random models. At a generic iteration k, we fix a trial sample size \(N_{k+1}^t\) and build a linear model \(m_k(p)\) around \(x_k\) of the form
where \(g_k\) is a random estimator to \(\nabla f_N(x_k)\). Then, we consider the trustregion problem
whose solution is
As in standard trustregion methods, we distinguish between successful and unsuccessful iterations. However, we do not employ here the classical acceptance condition, but a more elaborate one that involves the merit function (9).
The proposed method is sketched in Algorithm 3.1 and its steps are now discussed. At a generic iteration k, we have at hand the outcome of the previous iteration: the iterate \(x_k\), the sample sizes \( N_k\) and \({\widetilde{N}}_{k}\), the penalty parameter \( \theta _k \), the flag iflag. If iflag=succ the previous iteration was successful, i.e., \(x_{k}=x_{k1}+p_{k1}\), if iflag=unsucc the previous iteration was unsuccessful, i.e., \(x_k=x_{k1}\).
The scheduling procedure for generating the trial sample size \(N_{k+1}^t\) consists of Steps 1 and 2 of SIRTR. At Step 1, we determine a reference sample size \({\widetilde{N}}_{k+1}\le N\). If iflag=succ, then the infeasibility measure h is sufficiently decreased as stated in (20). If iflag=unsucc, \({\widetilde{N}}_{k+1}\) is left unchanged from the previous iteration, i.e., \({\widetilde{N}}_{k+1}={\widetilde{N}}_k\). We remark that (20) trivially implies \({\widetilde{N}}_{k+1}=N\) if \(N_k=N\) and that it holds at each iteration, even when it is not explicitly enforced at Step 1 (see forthcoming Lemma 1). In principle \({\widetilde{N}}_{k+1}\) could be the trial sample size but we aim at giving more freedom to the sample size selection process. Thus, at Step 2, we choose a trial sample size \( N_{k+1}^t\) complying with condition (21). On the one hand, such a condition allows the choice \( N_{k+1}^t< {\widetilde{N}}_{k+1}\) in order to reduce the computational effort; on the other hand, the choice \( N_{k+1}^t\ge {\widetilde{N}}_{k+1}\) is also possible in order to satisfy specific accuracy requirements that will be specified later. When \( N_{k+1}^t< {\widetilde{N}}_{k+1}\), condition (21) rules the largest possible distance between \(N_{k+1}^t\) and \({\widetilde{N}}_{k+1}\) in terms of \( \delta _k \); in case \( N_{k+1}^t\ge {\widetilde{N}}_{k+1}\), (21) is trivially satisfied.
At Step 3 we form the linear random model (10) and compute its minimizer within the trustregion. Specifically, we fix the cardinality \(N_{k+1,g}\) and choose the set of indices \(I_{N_{k+1,g}}\subseteq \{1, \ldots , N\}\) of cardinality \(N_{k+1,g}\). Then, we compute the estimator \(g_k\) of \(\nabla f_{N}(x_k)\) as
and the solution \(p_k\) in (12) of the trustregion subproblem (11). Further, we compute \(m_k(p_k)\) where \(m_k\) is defined in (10) and
with \(I_{N_{k+1}^t}\subseteq \{1, \ldots , N\}\) being a set of cardinality \(N_{k+1}^t\).
At Step 4 we compute the new penalty term \( \theta _{k+1}. \) The computation relies on the predicted reduction defined as
where \(\theta \in (0,1)\). This predicted reduction is a convex combination of the usual predicted reduction \(f_{N_k}(x_k)m_k(p_k) \) in trustregion methods, and the predicted reduction \( h(N_k)h({\widetilde{N}}_{k+1}) \) in infeasibility obtained in Step 1. The new parameter \( \theta _{k+1} \) is computed so that
If (16) is satisfied at \(\theta =\theta _k\) then \(\theta _{k+1}=\theta _k\), otherwise \( \theta _{k+1} \) is computed as the largest value for which the above inequality holds (see forthcoming Lemma 2).
Step 5 establishes if the iteration is successful or not. To this end, given a point \( {\hat{x}} \) and \(\theta \in (0,1) \), the actual reduction of \(\Psi \) at the point \( {\hat{x}}\) has the form
and the iteration is successful whenever the following two conditions are both satisfied
Otherwise the iteration is declared unsuccessful. If the iteration is successful, we accept the step and the trial sample size, set iflag=succ and possibly increase the trustregion radius through (23); the upper bound \( \delta _{\max } \) on the trust region size is imposed in (23). In case of unsuccessful iterations, we reject both the step and the trial sample size, set iflag=unsucc and decrease the trust region size.
Concerning conditions (18) and (19), we observe that the former mimics the classical acceptance criterion of standard trustregion methods while the latter drives \(\delta _k\) to zero as \(\Vert g_k\Vert \) tends to zero.
We conclude the description of Algorithm 3.1 showing that condition (20) holds for all iterations, even when it is not explicitly enforced at Step 1.
Lemma 1
Let Assumption 3.1 holds and \(r\in (0,1)\) be the scalar in Algorithm 3.1. The sample sizes \({\widetilde{N}}_{k+1}\le N\) and \(N_k\le N\) generated by Algorithm 3.1 satisfy
Proof
We observe that, by Assumption 3.1, (25) trivially holds whenever \(N_k={\widetilde{N}}_{k+1}=N\).
Otherwise, we proceed by induction. Indeed, the thesis trivially holds for \(k=0\), as we set iflag=succ at the first iteration and enforce (25) at Step 1. Now consider a generic iteration \({\bar{k}}\ge 1\) and suppose that (25) holds for \({\bar{k}}1\). If iteration \({\bar{k}}1\) is successful, then condition (25) is enforced for iteration \(\bar{k}\) at Step 1.
If iteration \(\bar{k}1\) is unsuccessful, then at Step 5 we set \(N_{{\bar{k}}}=N_{{\bar{k}}1}\). Successively, at Step 1 of iteration \(\bar{k}\) we set \({\widetilde{N}}_{{\bar{k}}+1}={\widetilde{N}}_{{\bar{k}}}\). Since (25) holds by induction at iteration \({\bar{k}}1\), we have \(h({\widetilde{N}}_{{\bar{k}}})\le r h(N_{{\bar{k}}1})\), which can be rewritten as \(h({\widetilde{N}}_{{\bar{k}}+1})\le r h(N_{{\bar{k}}})\) due to the previous assignments at Step 5 and Step 1. Then condition (25) holds also at iteration \({\bar{k}}\). \(\square \)
3.1 On the sequences \(\{\theta _k\}\) and \( \{\delta _k\}\)
In this section, we analyze the properties of Algorithm 3.1. In particular, we prove that the sequence \(\{\theta _k\}\) is non increasing and uniformly bounded from below, and that the trust region radius \(\delta _k\) tends to 0 as \(k\rightarrow \infty \). We make the following assumption.
Assumption 2
Functions \(\phi _i\) are continuously differentiable for \(i=1,\ldots ,n\). There exists \(f_{low}\in \mathbb {R}\) such that
Furthermore, there exist \(\Omega \subset {\mathbb {R}}^n\) and \(f_{up}\in \mathbb {R}\) such that
and all iterates generated by Algorithm 3.1 belong to \(\Omega \).
In the following, we let
Remark 1
In the context of machine learning, the above assumption is verified in several cases, e.g., the meansquares loss function coupled with either the sigmoid, the softmax or the hyperbolic tangent activation function; the meansquares loss function coupled with ReLU or ELU activation functions and bound constraints (above and below) on all variables; the logistic loss function coupled again with bound constraints (above and below) on the unknowns [25].
In the analysis that follows we will consider two options for \( {\hat{x}}\) in (17), \({\hat{x}} = x_k + p_k \) for successful iterations and \( {\hat{x}} = x_k \) for unsuccessful iterations.
Our first result characterizes the sequence \(\{\theta _k\}\) of the penalty parameters; the proof follows closely [19, Lemma 2.2].
Lemma 2
Let Assumptions 1 and 2 hold. Then the sequence \( \{\theta _k\} \) is positive, non increasing and bounded from below, \(\theta _{k+1}\ge {\underline{\theta }}>0\) with \({\underline{\theta }}\) independent of k and (16) holds with \(\theta =\theta _{k+1}\).
Proof
We note that \(\theta _0>0\) and proceed by induction assuming that \(\theta _k\) is positive. Due to Lemma 1, for all iterations k we have that \(N_k\le {\widetilde{N}}_{k+1}\) and that \(N_k={\widetilde{N}}_{k+1}\) if and only if \(N_k=N\). First consider the case where \(N_k={\widetilde{N}}_{k+1}\) (or equivalently \(N_k={\widetilde{N}}_{k+1}=N\)); then it holds \(h(N_k)h({\widetilde{N}}_{k+1})=0\), and \(N_{k+1}^t=N\) by Step 2. Therefore, we have \(\mathrm{{Pred}}_k(\theta )=\theta \delta _k \Vert g_k\Vert >0\) for any positive \(\theta \), and (22) implies \(\theta _{k+1}=\theta _k \). Let us now consider the case \(N_k<{\widetilde{N}}_{k+1}\). If inequality \(\mathrm{{Pred}}_k(\theta _k) \ge \eta _1(h(N_k)h({\widetilde{N}}_{k+1}))\) holds then (22) gives \(\theta _{k+1}=\theta _k\). Otherwise, we have
and since the right handside is negative by assumption, it follows
Consequently, \(\mathrm{{Pred}}_k(\theta )\ge \eta _1(h(N_k)h({\widetilde{N}}_{k+1}))\) is satisfied if
i.e., if
Hence \(\theta _{k+1}\) is the largest value satisfying (16) and \( \theta _{k+1} < \theta _k. \)
Let us now prove that \( \theta _{k+1} \ge {\underline{\theta }}. \) Note that by (25) and (8)
Using (26)
and \(\theta _{k+1}\) in (22) satisfies
which completes the proof. \(\square \)
In the following, we derive bounds for the actual reduction \(\mathrm{{Ared}}_k(x_{k+1},\theta _{k+1})\) in case of successful iterations and distinguish the iteration indexes k as below:
Note that \({\mathcal {I}}_1,{\mathcal {I}}_2\) are disjoint and any iteration index k belongs to exactly one of these subsets. Moreover, (25) yields \({\widetilde{N}}_{k+1}=N_k= N_{k+1}^t=N\) for any \(k \in {\mathcal {I}}_2\).
Lemma 3
Let Assumptions 12 hold and suppose that iteration k is successful. If \(k \in {{{\mathcal {I}}}}_1\) then
Otherwise,
Proof
Since iteration k is successful, \(x_{k+1}=x_k+p_k\) and (18) hold. Suppose \(k \in {{{\mathcal {I}}}}_1\). By (18) and (16)
In virtue of Lemma 1 we have \( h(N_k)  h({\widetilde{N}}_{k+1}) \ge (1r) h(N_k)\), hence we obtain
Dividing and multiplying the righthand side above by \(\delta _{k}^2\), applying the inequalities \( {\underline{h}}\le h(N_k)\), \(\delta _k\le \delta _{\max }\), we get (31).
Suppose \(k \in {{{\mathcal {I}}}}_2\). Then \(N_k={\widetilde{N}}_{k+1}\) and by the definition of \(\mathrm{{Pred}}_k(\theta _{k+1})\) and Lemma 2, we have
and therefore (18), (19) and Lemma 2 yield (32). \(\square \)
Let us now define a Lyapunov type function \( \Phi \) inspired by the paper [14]. Assumption 1 implies that \( h(N_k) \) is bounded from above while Assumption 2 implies that \( f_{N_k}(x) \) is bounded from below if \(x\in \Omega \). Thus, there exists a constant \( \Sigma \) such that
Definition 1
Let \( v \in (0,1) \) be a fixed constant. For all \(k\ge 0\), we define
where \(\Psi \) is the merit function given in (9) and \(\Sigma \) is given in (33).
The choice of \(v\in (0,1)\) in the above definition will be specified below. First, note that \(\phi _k\) is bounded below for all \(k\ge 0\),
Second, adding and subtracting suitable terms, by the definition (34) and for all \(k\ge 0\), we have
If the iteration k is successful, then using (33), the monotonicity of \(\{\theta _k\}_{k\in {\mathbb {N}}}\) proved in Lemma 2, and the fact that \(N_{k+1}=N_{k+1}^t\), the equality (36) yields
Otherwise, if the iteration k is unsuccessful, then \(x_{k+1}=x_k\), \(N_{k+1}=N_k\) and thus the first quantity at the righthand side of equality (36) is zero. Hence using again (33) and the monotonicity of \(\{\theta _k\}_{k\in {\mathbb {N}}}\), we obtain
Now we provide bounds for the change of \(\Phi \) along subsequent iterations and again distinguish the two cases \(k\in {{{\mathcal {I}}}}_1, {{{\mathcal {I}}}}_2\) stated in (29)(30).
Lemma 4

(i)
If the iteration k is unsuccessful, then
$$\begin{aligned} {\phi _{k+1}  \phi _k \le \chi _1 \delta _k^2, \qquad \chi _1= (1v)\frac{1\gamma ^2}{\gamma ^2}.} \end{aligned}$$(39) 
(ii)
If the iteration k is successful and \(k\in {\mathcal {I}}_1\), then
If the iteration k is successful and \(k\in {\mathcal {I}}_2\), then
Proof
(i) If iteration k is unsuccessful, the updating rule (24) for \(\delta _{k+1}\) implies \(\delta _{k+1}= \delta _k/\gamma \). Thus, equation (38) directly yields (39).
(ii) If iteration k is successful, the updating rule (23) for \(\delta _{k+1}\) implies \(\delta _{k+1}\le \gamma \delta _k\). Thus combining (37) with Lemma 3 we obtain (40) and (41). \(\square \)
We are now ready to prove that a sufficient decrease condition holds for \(\Phi \) along subsequent iterations and that \(\delta _k\) tends to zero.
Theorem 1
Let Assumptions 1 and 2 hold. There exists \(\sigma >0\), depending on \(v\in (0,1)\) in (34), such that
Proof
In case of unsuccessful iterations, (39) provides a sufficient decrease \(\phi _{k+1} \phi _k\) for any value of \(v\in (0,1)\). In case of successful iterations, \(\chi _2\) and \(\chi _3\) in (40) and (41) are both negative if
Therefore, if v is chosen as above and
then (39)–(41) imply (42) and the proof is completed. \(\square \)
Theorem 2
Let Assumptions 1 and 2 hold. Then the sequence \(\{\delta _k\}\) in Algorithm 3.1 satisfies
Proof
Under the stated conditions Theorem 1 holds and summing up (42) for \(j=0,1,\ldots ,k1\), we obtain
Given that, by (35), \( \phi _k \) is bounded from below for all k, we conclude that \( \sum _{j=0}^{\infty } \delta _j^2<\infty \), and hence \( \lim _{j\rightarrow \infty } \delta _j= 0.\) \(\square \)
3.2 Complexity analysis
Algorithm 3.1 generates a random process since the function estimates in (14) and gradient estimates in (13) are random. All random quantities are denoted by capital letters, while the use of small letters is reserved for their realizations. In particular, the iterates \(X_k\), the trust region radius \(\Delta _k\), the steps \(P_k\), the function estimates \(F_{N_{k+1}^t}(X_k),F_{N_{k+1}^t}(X_k+P_k)\), the gradient estimates \(G_k,\nabla F_{N_{k+1}^t}(X_k)\), and the value \(\Phi _k\) of the function \(\Phi \) in (34) at iteration k are random variables, while \(x_k\), \(\delta _k\), \(p_k\), \(f_{N_{k+1}^t}(x_k),f_{N_{k+1}^t}(x_k+p_k)\), \(g_k,\nabla f_{N_{k+1}^t}(x_k)\), \(\phi _k\) are their realizations.
In this section, our aim is to derive a bound on the expected number of iterations that occur in Algorithm 3.1 to reach a desired accuracy. We show that our algorithm is included into the stochastic framework given in [13, Section 2] and consequently we derive an upper bound on the expected value of the hitting time \({{{\mathcal {K}}}}_{\epsilon }\) defined below.
Definition 2
Given \(\epsilon >0\), the hitting time \({{{\mathcal {K}}}}_{\epsilon }\) is the random variable
i.e., \({{{\mathcal {K}}}}_{\epsilon }\) is the first iteration such that \(\Vert \nabla f_N(X_k)\Vert \le \epsilon \).
Our analysis relies on the assumption that \(g_k\) and \(\nabla f_{N_{k+1}^t}(x_k)\) are probabilistically accurate estimators of the true gradient at \(x_k\), in the sense that the events
are true at least with conditioned probability \(\pi _1\in (0,1) \) and \(\pi _2\in (0,1)\), respectively. Using the same terminology of [26, 27], we say that iteration k is true if both \({\mathcal {G}}_{k,1}\) and \({\mathcal {G}}_{k,2}\) are true. Furthermore, we introduce the two random variables
where \(\mathbbm {1}_A\) denotes the indicator function of an event A.
Finally, we need the following additional assumptions.
Assumption 3
The gradients \( \nabla \phi _i\) are Lipschitz continuous with constant \(L_i\). Let \({L=\frac{1}{2}\max _{1\le i\le N} L_i }\).
Under Assumptions 2 and 3, the norm of the gradient estimates \(\Vert g_k\Vert \) is bounded, as shown below.
Lemma 5
Let Assumptions 2 and 3 hold. Then there exists \(g_{\max }\) such that
where \(g_{max}=\sqrt{8L\kappa _{\phi }}\) and \(\kappa _{\phi }\) is given in (26).
Proof
By Assumption 3, it easily follows that \(\nabla f_{N_{k+1,g}}\) is Lipschitz continuous on \({\mathbb {R}}^n\) with constant 2L. Then Assumption 2 and the descent lemma for continuously differentiable functions with Lipschitz continuous gradient [28, Proposition A.24] ensure that
Taking the minimum of the righthand side with respect to y, we can also write
The minimum of \(\zeta (y)\) is attained at the point \(\bar{y}=\displaystyle x\frac{1}{2L}\nabla f_{N_{k+1,g}}(x)\) and letting \(y=\bar{y}\) in the previous inequality, we get:
and equivalently \(\Vert \nabla f_{N_{k+1,g}}(x)\Vert ^2\le 4L(f_{N_{k+1,g}}(x)f_{low})\) for all \(x\in \mathbb {R}^n\). Using again Assumption 2, we have \(f_{N_{k+1,g}}(x)f_{low}\le f_{N_{k+1,g}}(x)+f_{low}\le 2\kappa _{\phi }\), and consequently
Setting \(x=x_k\) and \(g_{max}=\sqrt{8L\kappa _{\phi }}\) in the previous inequality, we get the result. \(\square \)
First, we analyze the occurrence of successful iterations and show that the availability of accurate gradients has an impact on the acceptance of the trial steps. The following lemma establishes that if the iteration k is true and \(\delta _k\) is smaller than a certain threshold, then the iteration is successful. The analysis is presented for a single realization of Algorithm 3.1 and specializes for k in the sets \( {{{\mathcal {I}}}}_1\), \( {{{\mathcal {I}}}}_2\).
Lemma 6
Let Assumptions 1–3 hold and suppose that iteration k is true.

(i)
If \(k \in {{{\mathcal {I}}}}_1\), then the iteration is successful whenever
$$\begin{aligned} \delta _k \le \min \left\{ \frac{\Vert g_k\Vert }{\eta _3} , \, \frac{\Vert g_k\Vert }{\eta _2} \right\} , \end{aligned}$$(49)where \(\eta _3=\frac{\delta _{\max }g_{\max }(\theta _0(2\nu +L)+(1{\underline{\theta }})\mu )}{\eta _1(1\eta _1)(1r){\underline{h}}}\).

(ii)
If \(k \in {{{\mathcal {I}}}}_2\), then the iteration is successful whenever
$$\begin{aligned} { \delta _k \le \min \left\{ \frac{ (1\eta _1)\Vert g_k\Vert }{2\nu +L} ,\, \frac{\Vert g_k\Vert }{\eta _2} \right\} .} \end{aligned}$$(50)
Proof
From Assumption 3, it follows that \(\nabla f_{N_{k+1}^t}\) is Lipschitz continuous with constant 2L. Then,
and, since \({{{\mathcal {G}}}}_{k,1}\) and \({{{\mathcal {G}}}}_{k,1}\) are both true, (45) and (46) yield
Now, let us analyze condition (18) for successful iterations.
(i) If \(k \in {{{\mathcal {I}}}}_1\), by (15), (17) and (16) we obtain
Using (52), (21) and \({{\underline{\theta }}}\le \theta _{k+1}\le \theta _0\), we also have
Note that the combination of (25), (8), (23) and Lemma 5, guarantees that
Then, from (53), (54), and (55), we have
Combining this result with (19), the proof is complete.
(ii) Using (15), (17), \(k \in {{{\mathcal {I}}}}_2\), we have
Using (52) we get
Combining the above inequality with (19), we have proved that the iteration is successful whenever (50) holds. \(\square \)
We can now guarantee that a successful iteration k occurs whenever k is true, the prefixed accuracy \(\epsilon \) in Definition 2 has not been achieved at k, and \(\delta _k\) is below a certain threshold depending on \(\epsilon \). Again, the result is stated for a single realization of the algorithm.
Lemma 7
Let Assumptions 1–3 hold. Suppose that \(\Vert \nabla f_N(x_k)\Vert > \epsilon \), for some \(\epsilon >0\), the iteration k is true, and
Then, iteration k is successful.
Proof
By \(\Vert \nabla f_N(x_k)\Vert >\epsilon \), the occurrence of \({{\mathcal {G}}_{k,1}}\) and (57), we have
and this yields \(\Vert g_k\Vert \ge \frac{\epsilon }{2}\). Then, Lemma 6 implies that iteration k is successful. \(\square \)
We now proceed similarly to [13, Section 2] and analyse the random process \(\{(\Phi _k,\Delta _k,W_k)\}_{k\in {\mathbb {N}}}\) generated by Algorithm 3.1, where \(\Phi _k\) is the random variable whose realization is given in (34) and \(W_k\) is the random variable defined as
Clearly, \(W_k\) takes values \(\pm 1\). We denote with \({\mathbb {P}}_{k1}(\cdot )\) and \({\mathbb {E}}_{k1}(\cdot )\) the probability and expected value conditioned to the \(\sigma \)algebra generated by \(F_{N_{1}^t}(X_0),\ldots ,F_{N_{k}^t}(X_{k1})\), \(\nabla F_{N_{1}^t}(X_0),\ldots ,\nabla F_{N_{k}^t}(X_{k1})\), \(G_0,\ldots ,G_{k1}\). Then, we can prove the following result.
Lemma 8
Let Assumptions 1–3 hold, v as in (43), \(\delta ^\dagger \) as in (57) and \({{{\mathcal {K}}}}_{\epsilon }\) as in Definition 2. Suppose there exists some \(j_{\max }\ge 0\) such that \(\delta _{\max }=\gamma ^{j_{\max }}\delta _0\), and \(\delta _0>\delta ^{\dagger }\). Assume that the estimators \(G_k\) and \(\nabla f_{N_{k+1}^t}(X_k)\) are conditionally independent random variables, and the events \({{{\mathcal {G}}}}_{k,1},{{{\mathcal {G}}}}_{k,2}\) occur with sufficiently high probability, i.e.,
Then,

(i)
there exists \(\lambda >0\) such that \(\Delta _k\le \delta _0e^{\lambda \cdot j_{\max }}\) for all \(k\ge 0\);

(ii)
there exists a constant \( \delta _{\epsilon }=\delta _0e^{\lambda \cdot j_{\epsilon }}\) for some \(j_{\epsilon }\le 0\) such that, for all \(k\ge 0\),
$$\begin{aligned} \mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}\Delta _{k+1}\ge \mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}\min \{\Delta _ke^{\lambda W_{k+1}}, \delta _{\epsilon } \}, \end{aligned}$$(60)where \(W_{k+1}\) satisfies
$$\begin{aligned} {\mathbb {P}}_{k1}(W_{k+1}=1)={\pi _3}, \quad {\mathbb {P}}_{k1}(W_{k+1}=1)=1{\pi _3}; \end{aligned}$$(61) 
(iii)
there exists a nondecreasing function \(\ell :[0,\infty )\rightarrow (0,\infty )\) and a constant \(\Theta >0\) such that, for all \(k\ge 0\),
$$\begin{aligned} \mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}{\mathbb {E}}_{k1}[\Phi _{k+1}]\le \mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}\Phi _k\mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}\Theta \ell (\Delta _k). \end{aligned}$$(62)
Proof
The proof parallels that of [13, Lemma 7].
(i) Since \(\delta _{\max }=\gamma ^{j_{\max }}\delta _0\), we can set \(\lambda =\log (\gamma )>0\), and the thesis follows from Step 5 of Algorithm 3.1.
(ii) Let us set
and assume that \( \delta _{\epsilon }=\gamma ^{j_{\epsilon }}\delta _0\), for some integer \(j_{\epsilon }\le 0\); notice that we can always choose \(\xi \) sufficiently large so that this is true. As a consequence, \(\Delta _k= \gamma ^{i_k} \delta _\epsilon \) for some integer \(i_k\).
When \(\mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}=0\), inequality (60) trivially holds. Otherwise, conditioning on \(\mathbbm {1}_{\{{{{\mathcal {K}}}}_{\epsilon }>k\}}=1\), we can prove that
Indeed, for any realization such that \(\delta _k>\delta _{\epsilon }\), we have \(\delta _k\ge \gamma \delta _{\epsilon }\) and because of Step 5, it follows that \(\delta _{k+1}\ge \delta _{\epsilon }\). Now let us consider a realization such that \(\delta _k\le \delta _{\epsilon }\). Since \({{{\mathcal {K}}}}_{\epsilon }>k\) and \(\delta _{\epsilon }\le \delta ^{\dagger }\), if \(I_kJ_k=1\) (i.e., k is true), then we can apply Lemma 7 and conclude that k is successful. Hence, by Step 5, we have \(\delta _{k+1}=\min \{\delta _{\max },\gamma \delta _k\}\). If \(I_kJ_k=0\), then we cannot guarantee that k is successful; however, again using Step 5, we can write \(\delta _{k+1}\ge \gamma ^{1}\delta _k\). Combining these two cases, we get (64). If we observe that \(\delta _{\max }=\gamma ^{j_{\max }}\delta _0\ge { \gamma ^{j_{\epsilon }}\delta _{0}= \delta _{\epsilon }}\), and recall the definition of \(W_k\) in (58), then equation (64) easily yields (60). The probabilistic conditions (61) are a consequence of (59).
(iii) The thesis trivially follows from (42) with \(\ell (\Delta )=\Delta ^2\) and \(\Theta =\sigma \). \(\square \)
The previous lemma shows that the random process \(\{(\Phi _k,\Delta _k,W_k)\}_{k\in {\mathbb {N}}}\) complies with Assumption 2.1 of [13].
Theorem 3
Under the assumptions of Lemma 8, we have
where \(\xi \) is chosen as in (63) and \(\sigma \) is given in (44).
Proof
The claim follows directly by [13, Theorem 2]. \(\square \)
Remark 2
The requirement of (45) and (46) to hold in probability is less stringent than the overall conditions (3) and (4). Analogously to the discussion in Sect. 2, if \({\mathbb {E}}_{\omega }[\nabla \phi _{\omega }(x)\nabla f_N(x)^2]\le V_g\), then Chebyshev inequality guarantees that events (45) and (46) hold in probability when
Clearly, \(\min \{N_{k+1,g}, N_{k+1}^t\}={{{\mathcal {O}}}}(\delta _k^{2})\) and in general these sample sizes are expected to growth slower than in (6).
Finally, the complexity theory presented improves on [19] where the iteration complexity before reaching full precision \(M=N\) in (7) is estimated, and thereafter existing iteration complexity results for trustregion methods applied to (1) are invoked.
4 Numerical experience
In this section, we evaluate the numerical performance of SIRTR on some nonconvex optimization problems arising in binary classification and regression.
All the numerical results have been obtained by running MATLAB R2019a on an Intel Core i74510U CPU 2.002.60 GHz with an 8 GB RAM. For all our tests, we equip SIRTR with \(\delta _0=1\) as the initial trustregion radius, \(\delta _{\max }=100\), \(\gamma =2\), \(\eta =10^{1}\), \(\eta _2=10^{6}\). Concerning the inexact restoration phase, we borrow the implementation details from [19]. Specifically, the infeasibility measure h and the initial penalty parameter \(\theta _0\) are set as follows:
The updating rule for choosing \({\widetilde{N}}_{k+1}\) has the form
where \(1<{\widetilde{c}}<2\) is a prefixed constant factor; note that this choice of \({\widetilde{N}}_{k+1}\) satisfies (21) with \(r=(N({\widetilde{c}}1))/N\). At Step 2 the function sample size \(N_{k+1}^t\) is computed using the rule
Once the set \(I_{N_{k+1}^t}\) is fixed, the search direction \(g_k\in {\mathbb {R}}^n\) is computed via sampling as in (13) and the sample size \(N_{k+1,g}\) is fixed as
with \(c\in (0,1]\) and \(I_{N_{k+1,g}}\subseteq I_{N_{k+1}^t}\).
4.1 SIRTR performance
In the following, we show the numerical behaviour of SIRTR on nonconvex binary classification problems. Let \(\{(a_i, b_i)\}_{i=1}^N\) denote the pairs forming a training set with \(a_i \in \mathbb {R}^n\) containing the entries of the ith example, and \(b_i\in \{0, 1\}\) representing the corresponding label. Then, we address the following minimization problem
where the nonconvex objective function \(f_N\) is obtained by composing a leastsquares loss with the sigmoid function.
In Table 1, we report the information related to the datasets employed, including the number N of training examples, the dimension n of each example and the dimension \(N_T\) of the testing set \(I_{N_T}\).
We focus on three aspects: the classification error provided by the final iterate, the computational cost, the occurrence of termination before full accuracy in function evaluations is reached. The last issue is crucial because it indicates the ability of the inexact restoration approach to solve (69) with random models and to rule sampling and steplength selection.
The average classification error provided by the final iterate, say \(x_{\mathrm{fin}}\), is defined as
where \(b_i\) is the exact label of the \(i\)th instance of the testing set, and \(b_i^{pred}\) is the corresponding predicted label, given by \(b_i^{pred}={\text {max}}\{{\text {sign}}(a_i^Tx_{\mathrm{fin}}),0\}\).
The computational cost is measured in terms of full function and gradient evaluations. In our test problems, the main cost in the computation of \(\phi _i\), \(1\le i\le N\), is the scalar product \(a_i^Tx\): once this product is evaluated, it can be reused for computing \(\nabla \phi _i\). Nonetheless, following [32, Section 3.3], we count both function and gradient evaluations as if we were addressing a classification problem based on a neural net. Thus, computing a single function \(\phi _i\) requires \(\frac{1}{N}\) forward propagations, whereas the gradient evaluation corresponds to \(\frac{2}{N}\) propagations (an additional backward propagation is needed). Note that, once \(\phi _i\) is computed, the corresponding gradient \(\nabla \phi _i\) requires only \(\frac{1}{N}\) backward propagations. Hence, as in our implementation \(I_{N_{k+1,g}}\subseteq I_{N_{k+1}^t}\), the computational cost of SIRTR at each iteration k is determined by \(\frac{N_{k+1}^t+N_{k+1,g}}{N}\) propagations.
For all experiments in this section, we run SIRTR with \(x_0=(0,0,\ldots ,0)^T\) as initial guess, and stop it when either a maximum of 1000 iterations is reached or a maximum of 500 full function evaluations is performed or the condition
with \(\epsilon =10^{3}\), holds for a number of consecutive successful iterations such that the computational effort is equal to the effort needed in three iterations with full function and gradient evaluations.
Since the selection of sets \(I_{N_{k+1}^t}\) and \(I_{N_{k+1,g}}\) for computing \(f_{N_{k+1}^t}(x_k)\) and \(g_k\) is random, we perform 50 runs of SIRTR for each test problem. Results are reported in tables where the headings of the columns have the following meaning: cost is the overall number of full function and gradient evaluations averaged over the 50 runs,
err is the classification error given in (70) averaged over the 50 runs, sub the number of runs where the method is stopped before reaching full accuracy in function evaluations.
In a first set of experiments, we investigate the choice of \(N_{k+1,g}\) by varying the factor \(c\in (0,1]\) in (68). In particular, letting \({\widetilde{c}}=1.2\) in (66), \(\mu =100/N\) in (67) and \(N_0=\lceil 0.1N\rceil \) as in [19], we test the values \(c\in \{0.1,0.2,1\}\). The results obtained are reported in Table 2. We note that the classification error slightly varies with respect to the choice of \(N_{k+1,g}\), and that selecting \(N_{k+1,g}\) as a small fraction of \(N_{k+1}^t\) is quite convenient from a computationally point of view. By contrast, the choice \(N_{k+1,g}= N_{k+1}^t\) leads to the largest computational costs without providing a significant gain in accuracy. Besides the cost per iteration, equal to \(\frac{2N_{k+1}^t}{N}\) in this latter case, we observe that full accuracy in function evaluations is reached very often especially for certain datasets, see e.g., cina0, codrna, covertype, ijcnn1, phishing, realsim. Remarkably, the results in Table 2 highlight that random models compare favourably with respect to cost and classification errors.
Next, we show that SIRTR computational cost can be reduced by slowing down the growth rate of \(N_{k+1}^t\). This task can be achieved controlling the growth of \({\widetilde{N}}_{k+1}\) which affects \(N_{k+1}^t\) by means of (67). Letting \(c=0.1\), \(\mu =100/N\) and \(N_0=\lceil 0.1N\rceil \), we consider the choices \({\widetilde{c}}\in \{1.05,1.1,1.2\}\) in (66). Average results are reported in Table 3. We can observe that the fastest growth rate for \({\widetilde{N}}_{k+1}\) is generally more expensive than the other two choices, while the classification error is similar for all the three choices. Moreover, significantly for \({\widetilde{c}}= 1.05\) most runs stopped before reaching full function accuracy.
We now analyze three different values, \(N_0\in \{\lceil 0.001N\rceil , \lceil 0.01N\rceil ,\lceil 0.1N\rceil \}\), for the initial sample size \(N_0\). We apply SIRTR with \(\tilde{c}=1.05\) in (66), \(\mu =100/N\) in (67), and \(c=0.1\) in (68). Results are reported in Table 4. We can see that, reducing \(N_0\), the number of full function/gradient evaluations can further reduce in some datasets, and that for \(N_0= \lceil 0.01N\rceil \) the average classification error compares well with the error when \(N_0= \lceil 0.1N\rceil \); for instance, the best results for most datasets are obtained by shrinking \(N_0\) to \(1\%\) of the maximum sample size. We conclude pointing out that most of the runs are performed without reaching full precision in function evaluation.
As a further confirmation of the efficiency of SIRTR, in Table 5 we report the sample sizes obtained on average at the stopping iteration of SIRTR with parameters setting \(N_0=\lceil 0.01 N\rceil \), \(N_{k+1,g}=\lceil 0.1 { N_{k+1}^t}\rceil \), \({\widetilde{N}}_{k+1}=\min \{N,\lceil 1.05 N_k\rceil \}\), \(\mu =100/N\). More specifically, for each dataset, we show the mean value \({\overline{N}}_{\mathrm{fin}}\) obtained by averaging the sample sizes \(N_{\mathrm{fin},i}\), \( 1\le i\le 50\), used at the final iteration of SIRTR, the relative standard deviation \(s=\frac{1}{{\overline{N}}_{\mathrm{fin}}} \sqrt{\frac{\sum _{i=1}^{50}(N_{\mathrm{fin},i}{\overline{N}}_{\mathrm{fin}})^2}{50}}\) as a measure of dispersion of the final sample sizes with respect to the mean value, and the minimum and maximum sample sizes \(N_{\mathrm{fin}}^{\min },N_{\mathrm{fin}}^{\max }\) observed at the final iteration out of the 50 runs. From the reported values, we deduce that SIRTR terminates with a final sample size which is much smaller, on average, than the maximum sample size N.
Finally, in Figs. 1, 2, we report the plots of the sample sizes \(N_{k+1}^t\) and \({\widetilde{N}}_{k+1}\) with respect to the number of iterations, obtained by running SIRTR on the a9a and mnist datasets, respectively. In particular, we let either \(\mu =100/N\) or \(\mu =1\) in the update rule (67), \(\tilde{c}=1.05\) in (66), \(c=0.1\) in (68) and \(N_0=\lceil 0.1 N\rceil \). Note that a larger \(\mu \) allows for the decreasing of both \(N_{k+1}^t\) and \({\widetilde{N}}_{k+1}\) in the first iterations, whereas a linear growth rate is imposed only in later iterations. This behaviour is due to the update condition (67), which naturally forces \(N_{k+1}^t\) to coincide with \({\widetilde{N}}_{k+1}\) when \(\delta _k\) is sufficiently small. For both choices of \(\mu \), we see that \(N_{k+1}^t\) can grow slower than \({\widetilde{N}}_{k+1}\) at some iterations, thus reducing the computational cost per iteration of SIRTR.
4.2 Comparison with TRish
In this section we compare the performance of SIRTR with the socalled TrustRegionish algorithm (TRish) recently proposed in [33]. TRish is a stochastic gradient method based on a trustregion methodology. Normalized steps are used in a dynamic manner whenever the norm of the stochastic gradient is within a prefixed interval. In particular, the \(k\)th iteration of TRish is given by
where \(\alpha _k>0\) is the steplength parameter, \(0<\gamma _{2,k}<\gamma _{1,k}\) are positive constants, and \(g_k\in {\mathbb {R}}^{n}\) is a stochastic gradient estimate. This algorithm has proven to be particularly effective on binary classification and neural network training, especially if compared with the standard stochastic gradient algorithm [33, Section 4].
For our numerical tests, we implement TRish with subsampled gradients \(g_k=\nabla f_{S}(x_k)\) defined in (5). The steplength is constant, \(\alpha _k=\alpha \), \(\forall k\ge 0\), and \(\alpha \) is chosen in the set \( \{10^{3},10^{1},\sqrt{10^{1}},1,\sqrt{10}\}\). Following the procedure in [33, Section 4], we use constant parameters \(\gamma _{1,k}\equiv \gamma _1\), \(\gamma _{2,k}\equiv \gamma _2\) and select \(\gamma _1,\, \gamma _2\) as follows. First, Stochastic Gradient algorithm [4] is run with constant steplength equal to 1; second, the average norm G of stochastic gradient estimates throughout the runs is computed; third \(\gamma _1,\, \gamma _2\) are set as \(\gamma _1=\frac{4}{G}\), \(\gamma _2=\frac{1}{2G}\).
First, we compare TRish with SIRTR on the nonconvex optimization problem (69), using a9a, htru2, mnist, and phishing as datasets (see Table 1). Based on the previous section, we equip SIRTR with \(N_0=\lceil 0.01 N\rceil \), \(N_{k+1,g}=\lceil 0.1 { N_{k+1}^t}\rceil \), \({\widetilde{N}}_{k+1}=\min \{N,\lceil 1.05 N_k\rceil \}\), \(\mu =100/N\). In TRish, the sample size S of the stochastic gradient estimates is \(\lceil 0.01N\rceil \), which corresponds to the first sample size used in SIRTR. We run each algorithm for ten epochs on the datasets a9a and htru2 using the null initial guess. We perform 10 runs to report results on average.
After tuning, the parameter setting for TRish was \( \gamma _1 \approx 34.5805\), \( \gamma _2 \approx 4.3226\) for a9a, \( \gamma _1 \approx 57.9622\), \( \gamma _2 \approx 7.2453\) for htru2, \( \gamma _1 \approx 23.4376\), \( \gamma _2 \approx 2.9297\) for mnist, and \(\gamma _1 \approx 50.6409\), \( \gamma _2 \approx 6.3301\) for phishing. In Fig. 3, we report the decrease of the (average) classification error, training loss \(f_N\) and testing loss, \(f_{N_T}(x)= \frac{1}{N_T}\sum _{i\in I_{N_T}} \phi _i(x)\), over the (average) number of full function and gradient evaluations required by the algorithms. From these plots, we can see that SIRTR performs comparably to the best implementations of TRish on a9a, htru2, mnist, while showing a good, though not optimal, performance on phishing.
In accordance to the experience in [33], all parameters \(\gamma _1\) and \(\gamma _2\) and \(\alpha \) are problemdependent. For instance, the best performance of TRish is obtained with \(\alpha =10^{1}\) for a9a and with \(\alpha =10^{3}\) for htru2, respectively; by contrast, SIRTR performs well with an unique setting of the parameters which is the key feature of adaptive stochastic optimization methods.
As a second test, we compare the performance of SIRTR and TRish on a different nonconvex optimization problem arising from nonlinear regression. Letting \(\{(a_i,b_i)\}_{i=1}^N\) denote the training set, where \(a_i\in \mathbb {R}^n\) and \(b_i\in \mathbb {R}\) represent the feature vector and the target variable of the ith example, respectively, we aim at solving the following problem
where \(h(\cdot ;x):\mathbb {R}^n\rightarrow \mathbb {R}\) is a nonlinear prediction function.
For this second test, we use the air dataset [29], which contains 9358 instances of (hourly averaged) concentrations of polluting gases, as well as temperatures and relative/absolute air humidity levels, recorded at each hour in the period March 2004  February 2005 from a device located in a polluted area within an Italian city.
As in [34], our goal is to predict the benzene (C6H6) concentration from the knowledge of \(n=7\) features, including carbon monoxide (CO), nitrogen oxides (NO\(_x\)), ozone (O\(_3\)), nonmetanic hydrocarbons (NMHC), nitrogen dioxide (NO\(_2\)), air temperature, and relative air humidity. First, we preprocess the dataset by removing examples for which the benzene concentration is missing, reducing the dataset dimension from 9357 to 8991. Then, we employ \(70\%\) of the dataset for training (\(N=6294\)), and the remaining \(30\%\) for testing (\(N_T=2697\)). Since the concentration values have been recorded hourly, this means that we use the data measured in the first 9 months for the training phase, and the data related to the last 3 months for the testing phase. Finally, denoting with \(D=(d_{ij})\in \mathbb {R}^{(N+N_T)\times n}\) the matrix containing all the dataset examples along its rows, and setting
we scale all data values into the interval [0, 1] as follows
We apply SIRTR and TRish on problem (72), where the prediction function \(h(\cdot ;x)\) is chosen as a feedforward neural network based on a \(7\times 5 \times 1\) architecture (see [34] and references therein), with the two hidden layers both equipped with the linear activation function, and the output layer with the sigmoid activation function. We equip the two algorithms with the same parameter values employed in the previous tests, and run them 10 times for 10 epochs, using a random initial guess in the interval \([\frac{1}{2},\frac{1}{2}]\).
In Fig. 4, we report the decrease of the (average) training and testing losses provided by SIRTR and by TRish with different choices of the steplength \(\alpha \), whereas in Fig. 5 we show the benzene concentration estimations provided by the algorithms against the true concentration. These results confirm that the performances of SIRTR are comparable with those of TRish equipped with the best choice of the steplength and show the ability of SIRTR to automatically tune the steplength so as to obtain satisfactory results in terms of testing and training accuracy.
5 Conclusions
We proposed a stochastic gradient method coupled with a trustregion strategy and an inexact restoration approach for solving finitesum minimization problems. Functions and gradients are subsampled and the batch size is governed by the inexact restoration approach and the trustregion acceptance rule. We showed the theoretical properties of the method and gave a worstcase complexity result on the expected number of iterations required to reach an approximate firstorder optimality point. Numerical experience showed that the proposed method provides good results keeping the overall computational cost relatively low.
Data availability
The dataset CINA0 is no longer available in repositories but is available from the corresponding author on reasonable request. The other datasets analyzed during the current study are available in the repositories: http://www.csie.ntu.edu.tw/~cjlin/libsvm, http://yann.lecun.com/exdb/mnist, https://archive.ics.uci.edu/ml/index.php
References
Bellavia, S., Bianconcini, T., Krejić, N., Morini, B.: Subsampled firstorder optimization methods with applications in imaging. In: Chen, K., Schonlieb, C., Tai, X., Younces, L. (eds.) Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging. Springer, Switzerland AG (2021)
Curtis, F.E., Scheinberg, K.: Optimization methods for supervised machine learning: from linear models to deep learning. Leading Developments from INFORMS Communities. INFORMS, pp. 89–114 (2017)
Bottou, L., Curtis, F.C., Nocedal, J.: Optimization methods for largescale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Gower, R.M., Schmidt, M., Bach, F., Richtarik, P.: Variancereduced methods for machine learning. In Proceedings of the IEEE, vol. 108, pp. 1968–1983 (2020)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1, pp. 315–323 (2013)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Prog. 162, 83–112 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015)
Nguyen, L.M., Liu, J., Scheinberg, K., Takač, M.: Sarah: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient. In Proceedings of the 34th International Conference on Machine Learning, PMLR, vol. 70, pp. 2613–2621 (2017)
Curtis, F.E., Scheinberg, K.: Adaptive stochastic optimization: a framework for analyzing stochastic optimization algorithms. IEEE Signal Process. Mag. 37(5), 32–42 (2020)
Bellavia, S., Gurioli, G., Morini, B.: Adaptive cubic regularization methods with dynamic inexact hessian information and applications to finitesum minimization. IMA J. Numer. Anal. 41(1), 764–799 (2021)
Bellavia, S., Gurioli, G., Morini, B., Toint, P. L.: Adaptive regularization for nonconvex optimization using inexact function values and randomly perturbed derivatives. J. Complex 68, 101591 (2022)
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust region method via submartingales. INFORMS J. Optim. 1, 92–119 (2019)
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trustregion method and random models. Math. Program. 169(2), 447–487 (2018)
Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30, 349–376 (2020)
Krejić, N., Martínez, J.M.: Inexact restoration approach for minimization with inexact evaluation of the objective function. Math. Comput. 85, 1775–1791 (2016)
Birgin, G.E., Krejić, N., Martínez, J.M.: On the employment of inexact restoration for the minimization of functions whose evaluation is subject to programming errors. Math. Comput. 87(311), 1307–1326 (2018)
Birgin, G.E., Krejić, N., Martínez, J.M.: Iteration and evaluation complexity on the minimization of functions whose computation is intrinsically inexact. Math. Comput. 89, 253–278 (2020)
Bellavia, S., Krejić, N., Morini, B.: Inexact restoration with subsampled trustregion methods for finitesum minimization. Comput. Optim. Appl. 76, 701–736 (2020)
Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trustregion methods based on probabilistic models. SIAM J. Optim. 24(3), 1238–1264 (2014)
Bellavia, S., Gurioli, G., Morini, B., Toint, P.L.: Trustregion algorithms: probabilistic complexity and intrinsic noise with applications to subsampling techniques. EURO J. Comput. Optim. 10, 100043 (2022)
Xiaoyu, W., Yuan, Y.X.: Stochastic trust region methods with trust region radius depending on probabilistic models. J. Comput. Math. 40(2), 294–334 (2022)
Chauhan, V.K., Sharma, A., Dahiya, K.: Stochastic trust region inexact newton method for largescale machine learning. Int. J. Mach. Learn. Cybern. 11(7), 1541–1555 (2020)
Martínez, J.M., Pilotta, E.A.: Inexact restoration algorithms for constrained optimization. J. Optim. Theory Appl. 104, 135–163 (2000)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016) http://www.deeplearningbook.org
Beharas, A.S., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise. SIAM J. Optim. 31(2), 1489–1518 (2021)
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Progr. 169, 337–375 (2018)
Bertsekas, D.P. (ed.): Nonlinear Programming. Athena Scientific, Belmont, Massachusetts (2016)
Lichman, M.: UCI machine learning repository. (2013) https://archive.ics.uci.edu/ml/index.php
Chang, C.C., Lin, C.J.: LIBSVM : a library for support vector machines (2011) http://www.csie.ntu.edu.tw/cjlin/libsvm
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. In Proceedings of the IEEE, vol. 86, pp. 2278–2324 (1998)
Xu, P., RoostaKhorasani, F., Mahoney, M.W.: Secondorder optimization for nonconvex machine learning: an empirical study. In Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 199–207 (2020)
Curtis, F.E., Scheinberg, K., Shi, R.: A stochastic trust region algorithm based on careful step normalization. INFORMS J. Optim. 1(3), 200–220 (2019)
De Vito, S., Massera, E., Piga, M., Martinotto, L., Di Francia, G.: On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens. Actuator B 129, 750–757 (2008)
Funding
Open access funding provided by Università degli Studi di Firenze within the CRUICARE Agreement. The research that led to the present paper was partially supported by a grant of the group GNCS of INdAM and partially developed within the Mobility Project: “Second order methods for optimization problems in Machine Learning” (ID:RS19MO05) executive programme of Scientific and Technological cooperation between the Italian Republic and the Republic of Serbia 20192022. The work of the second author was supported by Serbian Ministry of Education, Science and Technological Development, Grant No. 451039/202114/200125.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bellavia, S., Krejić, N., Morini, B. et al. A stochastic firstorder trustregion method with inexact restoration for finitesum minimization. Comput Optim Appl 84, 53–84 (2023). https://doi.org/10.1007/s10589022004307
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589022004307