A stochastic first-order trust-region method with inexact restoration for finite-sum minimization

We propose a stochastic first-order trust-region method with inexact function and gradient evaluations for solving finite-sum minimization problems. Using a suitable reformulation of the given problem, our method combines the inexact restoration approach for constrained optimization with the trust-region procedure and random models. Differently from other recent stochastic trust-region schemes, our proposed algorithm improves feasibility and optimality in a modular way. We provide the expected number of iterations for reaching a near-stationary point by imposing some probability accuracy requirements on random functions and gradients which are, in general, less stringent than the corresponding ones in literature. We validate the proposed algorithm on some nonconvex optimization problems arising in binary classification and regression, showing that it performs well in terms of cost and accuracy, and allows to reduce the burdensome tuning of the hyper-parameters involved.


Introduction.
In this paper we consider the finite-sum minimization problem min where N is very large and finite and φ i : IR n → IR, 1 ≤ i ≤ N , are continuously differentiable.A number of important problems can be stated in this form, e.g., classification problems in machine learning, data fitting problems, sample average approximations of an objective function given in the form of mathematical expectation.
In recent years the need for efficient methods for solving (1.1) resulted in a large body of literature and a number of methods have been proposed and analyzed, see e.g., the reviews [3,12,21].
It is common to employ subsampled approximations of the objective function and its derivatives with the aim of reducing the computational cost.Focusing on first-order methods, the stochastic gradient [33] and more contemporary variants like SVRG [24,25], SAG [34], ADAM [26] and SARAH [31] are widely used for their simplicity and low cost per-iteration.They do not call for function evaluations but require tuning the learning rate and further possible hyper-parameters such as the mini-batch size.Since the tuning effort may be very computationally demanding [19], more sophisticated approaches use stochastic linesearch or trust-region strategies to adaptively choose the learning rate, see [3,5,7,11,18,19,32].In this context, function and gradient approximations have to satisfy sufficient accuracy requirements with some probability.This, in turn, in case of approximations via sampling, requires adaptive choices of the sample sizes used.
In a further stream of works, problem (1.1) is reformulated as a constrained optimization problem and the sample size is computed deterministically using the Inexact Restoration (IR) approach.The IR approach has been successfully combined with either the linesearch strategy [27] or the trust-region strategy [4,9,10]; in these papers, function and gradient estimates are built with gradually increasing accuracy and averaging on the same sample.
We propose a novel trust-region method with random models based on the IR methodology.In our proposed method, feasibility and optimality are improved in a modular way, and the resulting procedure differs from the existing stochastic trustregion schemes [1,6,11,18,35] in the acceptance rule for the step.We provide a theoretical analysis and give a bound on the expected iteration complexity to satisfy an approximate first-order optimality condition; this calls for accuracy conditions on random gradients that are assumed to hold with some sufficiently large but fixed probability and are, in general, less stringent than the corresponding ones in [1,6,11,18,35].Our theoretical analysis improves over the one for the stochastic trust-region method with inexact restoration given in [4], since we no longer rely on standard theory for deterministic unconstrained optimization invoked eventually when functions and gradients are computed exactly.
The paper is organized as follows.In Section 2 we give an overview of random models employed in the trust-region framework and introduce the main features of our contribution.The new algorithm is proposed in Section 3 and studied theoretically with respect to the iteration complexity analysis.Extensive numerical results are presented in Section 4.

2.
Trust-region method with random models.Variants of the standard trust-region method based on the use of random models have been presented, to our knowledge, in [1,4,6,11,17,18,35].They consist in the adaptation of the trust-region framework to the case where random estimates of the derivatives are introduced and function values are either computed exactly [1] or replaced by stochastic estimates [4,6,11,17,18,35].
The computation and acceptance of the iterates parallel the standard trust-region mechanism, and the success of the procedure relies on function values and models being sufficiently accurate with fixed and large enough probability.The accuracy requests in the mentioned works show many similarities; here we illustrate some issues related to the works [11,18,35], which are closer to our approach.
Let • denote the 2-norm throughout the paper.At iteration k of a firstorder stochastic trust-region model, given x k , the positive trust-region radius δ k and a random approximation g k of ∇f N (x k ), let consider the model Thus, the trust region step takes the form s k = −δ k g k / g k .Two estimates f k,0 and f k,s of f N at x k and x k + s k , respectively, are employed to either accept or reject the trial point x k +s k .The classical ratio between the actual and predicted reduction is replaced by and a successful iteration is declared when ρ k ≥ η 1 and g k ≥ η 2 δ k for some constants η 1 ∈ (0, 1) and positive and possibly large η 2 .Note that the computation of both the step s k and the denominator in (2.1) are independent of f N (x k ).Furthermore, note that a successful iteration might not yield an actual reduction in f N because the quantities involved in ρ k are random approximations to the true value of the objective function.
The condition g k ≥ η 2 δ k is not typical of standard trust-region and depends on the fact that δ k controls the accuracy of function and gradients.Specifically, the models used are required to be sufficiently accurate with some probability.The model ς k is supposed to be, p M -probabilistically, a κ * -fully linear model of f N on the ball B(x k , δ k ), i.e., the requirement with κ * > 0, has to be fulfilled at least with probability p M ∈ (0, 1).Moreover, the estimates f k,0 and f k,s are supposed to be p f -probabilistically F -accurate estimates of f N (x k ) and f N (x k + s k ), i.e., the requirement has to be fulfilled at least with probability p f ∈ (0, 1).Clearly, if f N is computed exactly then condition (2.3) is trivially satisfied.Convergence analysis in [11,18,35] shows that for p M and p f sufficiently large it holds lim k→∞ δ k = 0 almost surely.Moreover, if f N is bounded from below and ∇f N is Lipschitz continuous, then lim k→∞ ∇f N (x k ) = 0 almost surely.Interestingly, the accuracy in (2.2) and (2.3) increases as the trust region radius gets smaller but the probabilities p M and p f are fixed.
For problem (1.1) it is straightforward to build approximations of f N and ∇f N by sample average approximations where I M and I S are subsets of {1, . . ., N } of cardinality |I M | = M and |I S | = S, respectively.The choice of sample size such that (2.2) and (2.3) hold in probability is discussed in [18, §5] , with E being the expected value of a random variable, and assume and max{M, S} ≤ N. (2.5) Then f k,0 and f k,s built as in (2.4) with sample size M satisfy (2.3) with probability p f , while g k built as in (2.4) with sample size S satisfies ∇f N (x k ) − g k ≤ κ * δ k with probability p g .Furthermore, using Taylor expansion and Lipschitz continuity of ∇f N , it can be proved that (2.2) is met with probability p M = p f p g ; consequently, a κ * -fully linear model of In principle, conditions (2.2), (2.3) and lim k→∞ δ k = 0 imply that f k,0 , f k,s and g k will be computed at full precision for k sufficiently large.On the other hand, in applications such as machine learning, reaching full precision is unlikely since N is very large and termination is based on the maximum allowed computational effort or on the validation error.
2.1.Our contribution.We propose a trust-region procedure with random models based on (2.4) and combine it with the inexact restoration (IR) method for constrained optimization [30].To this end, we make a simple transformation of (1.1) into a constrained problem.Specifically, letting I M be an arbitrary nonempty subset of {1, . . ., N } of cardinality |I M | equal to M , we reformulate problem (1.1) as Using the IR strategy allows to improve feasibility and optimality in a modular way and gives rise to a procedure that differs from the existing trust-region schemes in the following respects.First, at each iteration a reference sample size is fixed and used as a guess for the approximation of function values.Second, the acceptance rule for the step is based on the condition g k ≥ η 2 δ k , for some η 2 > 0, and a sufficient decrease condition on a merit function that measures both the reduction of the objective function and the improvement in feasibility.Finally, the expected iteration complexity to satisfy an approximate first-order optimality condition is given, provided that, at each iteration k, the gradient estimates satisfy accuracy requirements of order O (δ k ); such accuracy requirements implicitly govern function approximations and are, in general, less stringent than the corresponding ones in [1,6,11,18,35], as carefully detailed in Section 3.
Our theoretical analysis improves over the analysis carried out in [4] for a similar stochastic trust-region coupled with inexact restoration, since here we do not rely on the occurrence of full precision, M = N in (2.6), reached eventually and do not apply standard theory for unconstrained optimization.In fact, the expected number of iterations until a prescribed accuracy is reached is provided without invoking full precision.
3. The Algorithm.In this section we introduce our new algorithm referred to as SIRTR (Stochastic Inexact Restoration Trust Region).
First, we introduce some issues of IR methods.The level of infeasibility with respect to the constraint M = N in (2.6) is measured by the following function h.Assumption 3.1.Let h : {1, 2, . . ., N } → IR be a monotonically decreasing function such that h(1) > 0, h(N ) = 0.
This assumption implies that there exist some positive h and h such that The IR methods improve feasibility and optimality in modular way using a merit function to balance the progress.Since the reductions in the objective function and infeasibility might be achieved to a different degree, the IR method employs the merit function with θ ∈ (0, 1).Our SIRTR algorithm is a trust-region method that employs first-order random models.At a generic iteration k, we fix a trial sample size N t k+1 and build a linear model m k (p) around x k of the form where g k is a random estimator to ∇f N (x k ).Then, we consider the trust-region problem min whose solution is As in standard trust-region methods, we distinguish between successful and unsuccessful iterations.However, we do not employ here the classical acceptance condition, but a more elaborate one that involves the merit function (3.2).The proposed method is sketched in Algorithm 3.1 and its steps are now discussed.At a generic iteration k, we have at hand the outcome of the previous iteration: the iterate x k , the sample sizes N k and N k , the penalty parameter θ k , the flag iflag.If iflag=succ the previous iteration was successful, i.e., x k = x k−1 + p k−1 , if iflag=unsucc the previous iteration was unsuccessful, i.e., x k = x k−1 .
The scheduling procedure for generating the trial sample size N t k+1 consists of Steps 1 and 2 of SIRTR.At Step 1, we determine a reference sample size N k+1 ≤ N .If iflag=succ, then the infeasibility measure h is sufficiently decreased as stated in (3.13).If iflag=unsucc, N k+1 is left unchanged from the previous iteration, i.e., N k+1 = N k .We remark that (3.13) trivially implies N k+1 = N if N k = N and that it holds at each iteration, even when it is not explicitly enforced at Step 1 (see forthcoming Lemma 3.1).In principle N k+1 could be the trial sample size but we aim at giving more freedom to the sample size selection process.Thus, at Step 2, we choose a trial sample size N t k+1 complying with condition (3.14).On the one hand, such a condition allows the choice N t k+1 < N k+1 in order to reduce the computational effort; on the other hand, the choice N t k+1 ≥ N k+1 is also possible in order to satisfy specific accuracy requirements that will be specified later.When N t k+1 < N k+1 , condition (3.14) rules the largest possible distance between N t k+1 and N k+1 in terms of δ k ; in case N t k+1 ≥ N k+1 , (3.14) is trivially satisfied.At Step 3 we form the linear random model (3.3) and compute its minimizer.Specifically, we fix the cardinality N k+1,g and choose the set of indices I N k+1,g ⊆ {1, . . ., N } of cardinality N k+1,g .Then, we compute the estimator g k of ∇f N (x k ) as and the solution p k in (3.5) of the trust-region subproblem (3.4).Further, we compute m k (p k ) where m k is defined in (3.3) and with I N t k+1 ⊆ {1, . . ., N } being a set of cardinality N t k+1 .At Step 4 we compute the new penalty term θ k+1 .The computation relies on the predicted reduction defined as where θ ∈ (0, 1).This predicted reduction is a convex combination of the usual predicted reduction If (3.9) is satisfied at θ = θ k then θ k+1 = θ k , otherwise θ k+1 is computed as the largest value for which the above inequality holds (see forthcoming Lemma 3.3).
Step 5 establishes if the iteration is successful or not.To this end, given a point x and θ ∈ (0, 1), the actual reduction of Ψ at the point x has the form and the iteration is successful whenever the following two conditions are both satisfied Otherwise the iteration is declared unsuccessful.If the iteration is successful, we accept the step and the trial sample size, set iflag=succ and possibly increase the trust-region radius through (3.16); the upper bound δ max on the trust region size is imposed in (3.16).In case of unsuccessful iterations, we reject both the step and the trial sample size, set iflag=unsucc and decrease the trust region size.
Concerning conditions (3.11) and (3.12), we observe that the former mimics the classical acceptance criterion of standard trust-region methods while the latter drives δ k to zero as g k tends to zero.
We conclude the description of Algorithm 3.1 showing that condition (3.13) holds for all iterations, even when it is not explicitly enforced at Step 1.
Lemma 3.1.Let Assumption 3.1 holds and r ∈ (0, 1) be the scalar in Algorithm 3.1.The sample sizes N k+1 ≤ N and N k ≤ N generated by Algorithm 3.1 satisfy Proof.We observe that, by Assumption 3.1, (3.18) trivially holds whenever Otherwise, we proceed by induction.Indeed, the thesis trivially holds for k = 0, as we set iflag=succ at the first iteration and enforce (3.18) Else set Compute g k as in (3.6), and set  3.1.On the sequences {θ k } and {δ k }.In this section, we analyze the properties of Algorithm 3.1.In particular, we prove that the sequence {θ k } is non increasing and uniformly bounded from below, and that the trust region radius δ k tends to 0 as k → ∞.We make the following assumption.Assumption 3.2.Functions φ i are continuously differentiable for i = 1, . . ., n.There exist Ω ⊂ R n and f low , f up such that and all iterates generated by Algorithm 3.1 belong to Ω.
In the following, we let In the context of machine learning, the above assumption is verified in several cases, e.g., the mean-squares loss function coupled with either the sigmoid, the softmax or the hyperbolic tangent activation function; the mean-squares loss function coupled with ReLU or ELU activation functions and proper bounds on the unknowns; the logistic loss function coupled with proper bounds on the unknowns [23].
In the analysis that follows we will consider two options for x in (3.10), x = x k +p k for successful iterations and x = x k for unsuccessful iterations.
Our first result characterizes the sequence {θ k } of the penalty parameters; the proof follows closely [4, Lemma 2.2].
Proof.We note that θ 0 > 0 and proceed by induction assuming that θ k is positive.Due to Lemma 3.1, for all iterations k we have that N k ≤ N k+1 and that N k = N k+1 if and only if N k = N .First consider the case where N k = N k+1 (or equivalently Let us now consider the case )) holds then (3.15) gives θ k+1 = θ k .Otherwise, we have and since the right hand-side is negative by assumption, it follows Hence θ k+1 is the largest value satisfying (3.9) and θ k+1 < θ k .
Let us now prove that θ k+1 ≥ θ.Note that by (3.18) and (3.1) and θ k+1 in (3.15) satisfies which completes the proof. 2 In the following, we derive bounds for the actual reduction Ared k (x k+1 , θ k+1 ) in case of successful iterations and distinguish the iteration indexes k as below: Note that I 1 , I 2 are disjoint and any iteration index k belongs to exactly one of these subsets.Moreover, (3.18) yields Otherwise, In virtue of Lemma 3.
Dividing and multiplying the right-hand side above by δ 2 k , applying the inequalities h ≤ h(N k ), δ k ≤ δ max , we get (3.24).
Suppose k ∈ I 2 .Then N k = N k+1 and by the definition of Pred k (θ k+1 ) and Lemma 3.3, we have and therefore (3.11), (3.12) and Lemma 3.3 yield (3.25). 2 Let us now define a Lyapunov type function Φ inspired by the paper [18].Assumption 3.1 implies that h(N k ) is bounded from above while Assumption 3.2 implies that f N k (x) is bounded from below if x ∈ Ω.Thus, there exists a constant Σ such that (3.26) Definition 3.5.Let v ∈ (0, 1) be a fixed constant.For all k ≥ 0, we define where Ψ is the merit function given in (3.2) and Σ is given in (3.26).
The choice of v ∈ (0, 1) in the above definition will be specified below.First, note that φ k is bounded below for all k ≥ 0, (3.28) Second, adding and subtracting suitable terms, by the definition (3.27) and for all k ≥ 0, we have If the iteration k is successful, then using (3.26), the monotonicity of {θ k } k∈N proved in Lemma 3.3, and the fact that N k+1 = N t k+1 , the equality (3.29) yields Otherwise, if the iteration k is unsuccessful, then x k+1 = x k , N k+1 = N k and thus the first quantity at the right-hand side of equality (3.29) is zero.Hence using again (3.26) and the monotonicity of {θ k } k∈N , we obtain ii) If the iteration k is successful and k ∈ I 1 , then If the iteration k is successful and k ∈ I 2 , then Proof.i) If iteration k is unsuccessful, the updating rule (3.17) for δ k+1 implies δ k+1 = δ k /γ.Thus, equation (3.31) directly yields (3.32).
ii) If iteration k is successful, the updating rule (3.16) for δ k+1 implies δ k+1 ≤ γδ k .Thus combining (3.30) with Lemma 3.4 we obtain (3.33) and (3.34). 2 We are now ready to prove that a sufficient decrease condition holds for Φ along subsequent iterations and that δ k tends to zero.
In case of successful iterations, χ 2 and χ 3 in (3.33) and (3.34) are both negative if Therefore, if v is chosen as above and Proof.Under the stated conditions Theorem 3.7 holds and summing up (3.35) for j = 0, 1, . . ., k − 1, we obtain Given that, by (3.28), φ k is bounded from below for all k, we conclude that ∞ j=0 δ 2 j < ∞, and hence lim j→∞ δ j = 0.
3.2.Complexity analysis.Algorithm 3.1 generates a random process since the function estimates f N t k+1 (x k ) in (3.7) and gradient estimates g k in (3.6) are random.All random quantities are denoted by capital letters, while the use of small letters is reserved for their realizations.In particular, the iterates X k , the trust region radius ∆ k , the gradient estimates G k , ∇f N t k+1 (X k ), and the value Φ k of the function Φ in (3.27) at iteration k are random variables, while x k , δ k , g k and φ k are their realizations.
We denote with P k−1 (•) and E k−1 (•) the probability and expected value conditioned to the past until iteration k − 1.
In this section, our aim is to derive a bound on the expected number of iterations that occur in Algorithm 3.1 to reach a desired accuracy.We show that our algorithm is included into the stochastic framework given in [11, §2] and consequently we derive an upper bound on the expected value of the hitting time K defined below.Definition 3.9.Given > 0, the hitting time K is the random variable Our analysis relies on the assumption that g k and ∇f N t k+1 (x k ) are probabilistically accurate estimators of the true gradient at x k , in the sense that the events are true at least with probability π 1 ∈ (0, 1) and π 2 ∈ (0, 1), respectively.Using the same terminology of [2,15], we say that iteration k is true if both G k,1 and G k,2 are true.Furthermore, we introduce the two random variables where 1 A denotes the indicator function of an event A.
Finally, we need the following additional assumptions.Assumption 3.3.The gradients ∇φ i are Lipschitz continuous with constant L i .Let L = 1 2 max 1≤i≤N L i .Assumption 3.4.There exists g max such that We observe that the loss functions mentioned in Remark 3.2 satisfy Assumption 3.4.
First, we analyze the occurrence of successful iterations and show that the availability of accurate gradients has an impact on the acceptance of the trial steps.The following lemma establishes that if the iteration k is true and δ k is smaller than a certain threshold, then the iteration is successful.The analysis is presented for a single realization of Algorithm 3.1 and specializes for k in the sets I 1 , I 2 .Lemma 3.10.Let Assumptions 3.1-3.4hold and suppose that iteration k is true.i) If k ∈ I 1 , then the iteration is successful whenever where then the iteration is successful whenever
ii) Using (3.8), (3.10), k ∈ I 2 , we have Combining the above inequality with (3.12), we have proved that the iteration is successful whenever (3.42) holds. 2 We can now guarantee that a successful iteration k occurs whenever k is true, the prefixed accuracy in Definition 3.9 has not been achieved at k, and δ k is below a certain threshold depending on .Again, the result is stated for a single realization of the algorithm.
Proof.By ∇f N (x k ) > , the occurrence of G k,1 and (3.49), we have and this yields g k ≥ 2 .Then, Lemma 3.10 implies that iteration k is successful.
We now proceed similarly to [11, §2] and analyse the random process {(Φ k , ∆ k , W k )} k∈N generated by Algorithm 3.1, where Φ k is the random variable whose realization is given in (3.27) and W k is the random variable defined as Clearly, W k takes values ±1.Then, we can prove the following result.
ii) Let us set and assume that δ = γ j δ 0 , for some integer j ≤ 0; notice that we can always choose ξ sufficiently large so that this is true.As a consequence, ∆ k = γ i k δ for some integer i k .
Theorem 3.13.Under the assumptions of Lemma 3.12, we have where ξ is chosen as in (3.55) and σ is given in (3.37).
Proof.The claim follows directly by [11,Theorem 2]. 2 Remark 3.14.The requirement of (3.38) and (3.39) to hold in probability is less stringent than the overall conditions (2.2) and (2.3).Analogously to the discussion in Section 2, if . ., N , then Chebyshev inequality guarantees that events (3.38) and (3.39) hold in probability when Clearly, min{N k+1,g , N t k+1 } = O(δ −2 k ) and in general these sample sizes are expected to growth slower than in (2.5).
Finally, the complexity theory presented improves on [4] where the iteration complexity before reaching full precision M = N in (2.6) is estimated, and thereafter existing iteration complexity results for trust-region methods applied to (1.1) are invoked.
4. Numerical experience.In this section, we evaluate the numerical performance of SIRTR on some nonconvex optimization problems arising in binary classification and regression.
All the numerical results have been obtained by running MATLAB R2019a on an Intel Core i7-4510U CPU 2.00-2.60GHz with an 8 GB RAM.For all our tests, we equip SIRTR with δ 0 = 1 as the initial trust-region radius, δ max = 100, γ = 2, η = 10 −1 , η 2 = 10 −6 .Concerning the inexact restoration phase, we borrow the implementation details from [4].Specifically, the infeasibility measure h and the initial penalty parameter θ 0 are set as follows: The updating rule for choosing N k+1 has the form where 1 < c < 2 is a prefixed constant factor; note that this choice of N k+1 satisfies (3.13) with r = (N −( c−1))/N .At Step 2 the function sample size N t k+1 is computed using the rule Once the set I N t k+1 is fixed, the search direction g k ∈ R n is computed via sampling as in (3.6) and the sample size N k+1,g is fixed as with c ∈ (0, 1] and I N k+1,g ⊆ I N t k+1 .4.1.SIRTR performance.In the following, we show the numerical behaviour of SIRTR on nonconvex binary classification problems.Let {(a i , b i )} N i=1 denote the pairs forming a training set with a i ∈ IR n containing the entries of the i-th example, and b i ∈ {0, 1} representing the corresponding label.Then, we address the following minimization problem min where the nonconvex objective function f N is obtained by composing a least-squares loss with the sigmoid function.
In Table 4.1, we report the information related to the datasets employed, including the number N of training examples, the dimension n of each example and the dimension N T of the testing set I N T .
We focus on three aspects: the classification error provided by the final iterate, the computational cost, the occurrence of termination before full accuracy in function evaluations is reached.The last issue is crucial because it indicates the ability of the inexact restoration approach to solve (4.4) with random models and to rule sampling and steplength selection.60000 784 10000 phishing [16] 7739 68 3316 real-sim [16] 50616 20958 21693 w7a [16] 17284 300 7408 w8a [16] 34824 300 14925 The average classification error provided by the final iterate, say x fin , is defined as where b i is the exact label of the i−th instance of the testing set, and b pred i is the corresponding predicted label, given by b pred i = max{sign(a T i x fin ), 0}.The computational cost is measured in terms of full function and gradient evaluations.In our test problems, the main cost in the computation of φ i , 1 ≤ i ≤ N , is the scalar product a T i x: once this product is evaluated, it can be reused for computing ∇φ i .Nonetheless, following [36, Section 3.3], we count both function and gradient evaluations as if we were addressing a classification problem based on a neural net.Thus, computing a single function φ i requires 1 N forward propagations, whereas the gradient evaluation corresponds to 2  N propagations (an additional backward propagation is needed).Note that, once φ i is computed, the corresponding gradient ∇φ i requires only 1 N backward propagations.Hence, as in our implementation I N k+1,g ⊆ I N t k+1 , the computational cost of SIRTR at each iteration k is determined by propagations.For all experiments in this section, we run SIRTR with x 0 = (0, 0, . . ., 0) T as initial guess, and stop it when either a maximum of 1000 iterations is reached or a maximum of 500 full function evaluations is performed or the condition with = 10 −3 , holds for a number of consecutive successful iterations such that the computational effort is equal to the effort needed in three iterations with full function and gradient evaluations.Since the selection of sets I N t k+1 and I N k+1,g for computing f N t k+1 (x k ) and g k is random, we perform 50 runs of SIRTR for each test problem.Results are reported in tables where the headings of the columns have the following meaning: cost is the overall number of full function and gradient evaluations averaged over the 50  runs, err is the classification error given in (4.5) averaged over the 50 runs, sub the number of runs where the method is stopped before reaching full accuracy in function evaluations.
In a first set of experiments, we investigate the choice of N k+1,g by varying the factor c ∈ (0, 1] in (4.3).In particular, letting c = 1.2 in (4.1), µ = 100/N in (4.2) and N 0 = 0.1N as in [4], we test the values c ∈ {0.1, 0.2, 1}.The results obtained are reported in Table 4.2.We note that the classification error slightly varies with respect to the choice of N k+1,g , and that selecting N k+1,g as a small fraction of N t k+1 is quite convenient from a computationally point of view.By contrast, the choice N k+1,g = N t k+1 leads to the largest computational costs without providing a significant gain in accuracy.Besides the cost per iteration, equal to 2N t k+1 N in this latter case, we observe that full accuracy in function evaluations is reached very often especially for certain datasets, see e.g., cina0, cod-rna, covertype, ijcnn1, phishing, realsim.Remarkably, the results in Table 4.2 highlight that random models compare favourably with respect to cost and classification errors.
Next, we show that SIRTR computational cost can be reduced by slowing down the growth rate of N t k+1 .This task can be achieved controlling the growth of N k+1 which affects N t k+1 by means of (4.2).Letting c = 0.1, µ = 100/N and N 0 = 0.1N , we consider the choices c ∈ {1.05, 1.1, 1.2} in (4.1).Average results are reported in Table 4.3.We can observe that the fastest growth rate for N k+1 is generally more expensive than the other two choices, while the classification error is similar for all the three choices.Moreover, significantly for c = 1.05 most runs stopped before reaching full function accuracy.
We now analyze three different values, N 0 ∈ { 0.001N , 0.01N , 0.1N }, for the initial sample size N 0 .We apply SIRTR with c = 1.05 in (4.1), µ = 100/N in (4.2), and c = 0.1 in (4.3).Results are reported in Table 4.4.We can see that, reducing N 0 , the number of full function/gradient evaluations can further reduce in some datasets, and that for N 0 = 0.01N the average classification error compares well with the error when N 0 = 0.1N ; for instance, the best results for most datasets are obtained by shrinking N 0 to 1% of the maximum sample size.We conclude pointing out that most of the runs are performed without reaching full precision in function evaluation.As a further confirmation of the efficiency of SIRTR, in Table 4.5 we report the sample sizes obtained on average at the stopping iteration of SIRTR with parameters setting N 0 = 0.01N , N k+1,g = 0.1N t k+1 , N k+1 = min{N, 1.05N k }, µ = 100/N .More specifically, for each dataset, we show the mean value N fin obtained by averaging the sample sizes N fin,i , 1 ≤ i ≤ 50, used at the final iteration of SIRTR, the relative as a measure of dispersion of the final sample sizes with respect to the mean value, and the minimum and maximum sample sizes N min fin , N max fin observed at the final iteration out of the 50 runs.From the reported values, we deduce that SIRTR terminates with a final sample size which is much smaller, on average, than the maximum sample size N .
Finally, in Figures 4.1-4.2,we report the plots of the sample sizes N t k+1 and N k+1 with respect to the number of iterations, obtained by running SIRTR on the a9a and mnist datasets, respectively.In particular, we let either µ = 100/N or µ = 1 in the  update rule (4.2), c = 1.05 in (4.1), c = 0.1 in (4.3) and N 0 = 0.1N .Note that a larger µ allows for the decreasing of both N t k+1 and N k+1 in the first iterations, whereas a linear growth rate is imposed only in later iterations.This behaviour is due to the update condition (4.2), which naturally forces N t k+1 to coincide with N k+1 when δ k is sufficiently small.For both choices of µ, we see that N t k+1 can grow slower than N k+1 at some iterations, thus reducing the computational cost per iteration of SIRTR.

Comparison with TRish.
In this section we compare the performance of SIRTR with the so-called Trust-Region-ish algorithm (TRish) recently proposed in [20].TRish is a stochastic gradient method based on a trust-region methodology.Normalized steps are used in a dynamic manner whenever the norm of the stochastic gradient is within a prefixed interval.In particular, the k−th iteration of TRish is where h(•; x) : IR n → IR is a nonlinear prediction function.
For this second test, we use the air dataset [29], which contains 9358 instances of (hourly averaged) concentrations of polluting gases, as well as temperatures and relative/absolute air humidity levels, recorded at each hour in the period March 2004 -February 2005 from a device located in a polluted area within an Italian city.
As in [22], our goal is to predict the benzene (C6H6) concentration from the knowledge of n = 7 features, including carbon monoxide (CO), nitrogen oxides (NO x ), ozone (O 3 ), non-metanic hydrocarbons (NMHC), nitrogen dioxide (NO 2 ), air temperature, and relative air humidity.First, we preprocess the dataset by removing examples for which the benzene concentration is missing, reducing the dataset dimension from 9357 to 8991.Then, we employ 70% of the dataset for training (N = 6294), and the remaining 30% for testing (N T = 2697).Since the concentration values have been recorded hourly, this means that we use the data measured in the first 9 months for the training phase, and the data related to the last 3 months for the testing phase.We apply SIRTR and TRish on problem (4.7),where the prediction function h(•; x) is chosen as a feed-forward neural network based on a 7 × 5 × 1 architecture (see [22] and references therein), with the two hidden layers both equipped with the linear activation function, and the output layer with the sigmoid activation function.We equip the two algorithms with the same parameter values employed in the previous tests, and run them 10 times for 10 epochs, using a random initial guess in the interval [− 1 2 , 1  2 ].In Figure 4.4, we report the decrease of the (average) training and testing losses provided by SIRTR and by TRish with different choices of the steplength α, whereas in Figure 4.5 we show the benzene concentration estimations provided by the algorithms against the true concentration.These results confirm that the performances of SIRTR are comparable with those of TRish equipped with the best choice of the steplength and show the ability of SIRTR to automatically tune the steplength so as to obtain satisfactory results in terms of testing and training accuracy.

5.
Conclusions.We proposed a stochastic gradient method coupled with a trust-region strategy and an inexact restoration approach for solving finite-sum minimization problems.Functions and gradients are subsampled and the batch size is governed by the inexact restoration approach and the trust-region acceptance rule.We showed the theoretical properties of the method and gave a worst-case complexity result on the expected number of iterations required to reach an approximate first-order optimality point.Numerical experience shows that the proposed method provides good results keeping the overall computational cost relatively low.
Data Availability.The dataset CINA0 is no longer available in repositories but is available from the corresponding author on reasonable request.The other datasets

2 Algorithm 3 . 1 :
at Step 1.Now consider a generic iteration k ≥ 1 and suppose that (3.18) holds for k − 1.If iteration k − 1 is successful, then condition (3.18) is enforced for iteration k at Step 1.If iteration k − 1 is unsuccessful, then at Step 5 we set Nk = Nk −1 .Successively, at Step 1 of iteration k we set Nk +1 = Nk.Since (3.18) holds by induction at iteration k − 1, we have h( Nk) ≤ rh(Nk −1 ), which can be rewritten as h( Nk +1 ) ≤ rh(Nk) due to the previous assignments at Step 5 and Step 1. Then condition (3.18) holds also at iteration k.The Stochastic IRTR algorithm iflag=unsucc and go to Step 1.

. 31 )
Now we provide bounds for the change of Φ along subsequent iterations and again distinguish the two cases k ∈ I 1 , I 2 stated in (3.22)-(3.23).Lemma 3.6.Let Assumptions 3.1-3.2hold.i) If the iteration k is unsuccessful, then
.56) Indeed, for any realization such that δ k > δ , we have δ k ≥ γδ and because of Step 5, it follows that δ k+1 ≥ δ .Now let us consider a realization such that δ k ≤ δ .Since K > k and δ ≤ δ † , if I k J k = 1 (i.e., k is true), then we can apply Lemma 3.11 and conclude that k is successful.Hence, by Step 5, we have δ k+1 = min{δ max , γδ k }.If I k J k = 0, then we cannot guarantee that k is successful; however, again using Step 5, we can write δ k+1 ≥ γ −1 δ k .Combining these two cases, we get (3.56).If we observe that

Table 4 . 2
Results with three different rules for computing the sample size N k+1,g .

Table 4 . 3
Results with three different rules for computing the sample size N k+1 .

Table 4 . 4
Results with three different initial sample sizes N 0 .

Table 4 . 5
Average sample size N fin obtained at the final iteration, relative standard deviation s, minimum and maximum sample sizes N min fin , N max fin observed at the final iteration.Parameters setting: N 0