Bolstering Stochastic Gradient Descent with Model Building

: Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the step length. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates second-order information that allows adjusting not only the step length but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems. More precisely, SMB requires less tuning, and shows comparable performance to other adaptive methods.

Stochastic gradient descent (SGD) is a stochastic-approximation type optimization algorithm with several variants and a well-studied theory (Tadić, 1997;Chen et al., 2023).It is a popular choice for machine learning applications; in practice, it can achieve fast convergence when its stepsize and its scheduling are tuned well for the specific application at hand.However, this tuning procedure can take up to thousands of CPU/GPU days resulting in big energy costs (Asi and Duchi, 2019).A number of researchers have studied adaptive strategies for improving the direction and the step length choices of the stochastic gradient descent algorithm.Adaptive sample size selection ideas (Byrd et al., 2012;Balles et al., 2017;Bollapragada et al., 2018) improve the direction by reducing its variance around the negative gradient of the empirical loss function, while stochastic quasi-Newton algorithms (Byrd et al., 2016;Wang et al., 2017) provide adaptive preconditioning.Recently, several stochastic line search approaches have been proposed.Not surprisingly, some of these work cover sample size selection as a component of the proposed line search algorithms (Balles et al., 2017;Paquette and Scheinberg, 2020).
The Stochastic Model Building (SMB) algorithm proposed in this paper is not designed as a stochastic quasi-Newton algorithm in the sense explained by Bottou et al. (2018).However, it still produces a scaling matrix in the process of generating trial points, and its overall step at each outer iteration can be written in the form of matrix-vector multiplication.Unlike the algorithms proposed by Mokhtari and Ribeiro (2014) and Schraudolph et al. (2007), we have no accumulation of curvature pairs throughout several iterations.Since there is no memory carried from earlier iterations, the scaling matrices in individual past iterations are based only on the data samples employed in those iterations.In other words, the scaling matrix and the incumbent random gradient vector are dependent.That being said, we also provide a version (SMBi), where the matrix and gradient vector in question become independent (see Algorithm 2).Vaswani et al. (2019) apply a deterministic globalization procedure on mini-batch loss functions.That is, the same sample is used in all function and gradient evaluations needed to apply the line search procedure at a given iteration.However, unlike our case, they employ a standard line search procedure that does not alter the search direction.They establish convergence guarantees for the empirical loss function under the interpolation assumption, which requires each component loss function to have zero gradient at a minimizer of the empirical loss.Mutschler and Zell (2020) assume that the optimal learning rate (i.e., step length) along the negative batch gradient is a good estimator for the optimal learning rate with respect to the empirical loss along the same direction.They test validity of this assumption empirically on deep neural networks (DNNs).Rather than making such strong assumptions, we stick to the general theory for stochastic quasi-Newton methods.
Other work follow a different approach to translate deterministic line search procedures into a stochastic setting, and they do not employ fixed samples.In Mahsereci and Hennig (2017), a probabilistic model along the search direction is constructed via techniques from Bayesian optimization.Learning rates are chosen to maximize the expected improvement with respect to this model and the probability of satisfying Wolfe conditions.Paquette and Scheinberg (2020) suggest an algorithm closer to the deterministic counterpart, where the convergence is based on the requirement that the stochastic function and gradient evaluations approximate their true values with a high enough probability.
Finally, we should mention that the finite-sum minimization problem is a special case of the general expected value minimization problem, for which certain modification ideas for SGD regarding the selection of the search direction and the step length can be applicable.One such idea is gradient aggregation, which adds to the search direction of SGD a variance reducing component obtained via stochastic gradient evaluations at previous iterates (Roux et al., 2012;Defazio et al., 2014).In Malinovsky et al. (2022), an aggregated-gradient-type step is produced in a distributed setting where the overall step is produced by employing step lengths at two levels.Another idea is to use an extended step length control strategy depending on the objective value and the norm of the computed direction that might occasionally set the step length to zero (Liuzzi et al., 2022).However, it is not clear how these ideas can be extended to the more general case of expected value minimization.
With our current work, we make the following contributions.We use a model building strategy for adjusting the step length and the direction of a stochastic gradient vector.This approach also permits us to work on subsets of parameters.This feature makes our model steps not only adaptive, but also suitable to incorporate into the existing implementations of DNNs.Our method changes the direction of the step as well as its length.This property separates our approach from the backtracking line search algorithms.It also incorporates the most recent curvature information from the current point.This is in contrast with the stochastic quasi-Newton methods which use the information from the previous steps.Capitalizing our discussion on the independence of the sample batches, we also give a convergence analysis for SMB.Finally, we illustrate the computational performance of our method with a set of numerical experiments and compare the results against those obtained with other well-known methods.
1. Stochastic Model Building.We introduce a new stochastic unconstrained optimization algorithm in order to approximately solve problems of the form (1) where F : R n × R d → R is continuously differentiable and possibly nonconvex, ξ ∈ R d denotes a random variable, and E[.] stands for the expectation taken with respect to ξ.We assume the existence of a stochastic first-order oracle which outputs a stochastic gradient g(x, ξ) of f for a given x.A common approach to tackle (1) is to solve the empirical risk problem where f i : R n → R is the loss function corresponding to the ith data sample, and N denotes the data sample size which can be very large in modern applications.
As an alternative approach to line search for SGD, we propose a stochastic model building strategy inspired by the work of Öztoprak and Birbil (2018).Unlike core SGD methods, our approach aims at including a curvature information that adjusts not only the step length but also the search direction.Öztoprak and Birbil (2018) consider only the deterministic setting and they apply the model building strategy repetitively until a sufficient descent is achieved.In our stochastic setting, however, we have observed experimentally that using multiple model steps does not benefit much to the performance, and its cost to the runtime can be extremely high in large-scale (e.g., deep learning) problems.Therefore, if the sufficient descent is not achieved by the stochastic gradient step, then we construct only one model to adjust the length and the direction of the step.
Conventional stochastic quasi-Newton methods adjust the gradient direction by a scaling matrix that is constructed by the information from the previous steps.Our model building approach, however, uses the most recent curvature information around the latest iteration.In popular deep learning model implementations, model parameters come in groups and updates are applied to each parameter group separately.Therefore, we also propose to build a model for each parameter group separately making the step lengths adaptive.
The proposed iterative algorithm SMB works as follows: At step k, given the iterate x k , we calculate the stochastic function value where m k is the batch size, and ξ k = (ξ k,1 , . . ., ξ k,m k ) is the realization of the random vector ξ.Then, we apply the SGD update to calculate the trial step s t k = −α k g k , where {α k } k is a sequence of learning rates.With this trial step, we also calculate the function and gradient values where c > 0 is a hyper-parameter.If the condition is satisfied and we achieve sufficient decrease, then we set x k+1 = x t k as the next step.If the Armijo condition is not satisfied, following Öztoprak and Birbil (2018), we build a quadratic model using the linear models at the points x k,p and x t k,p for each parameter group p and find the step s k,p to reach its minimum point.Here, x k,p and x t k,p denote respectively the coordinates of x k and x t k that correspond to the parameter group p.We calculate the next iterate x k+1 = x k + s k , where s k = (s k,p1 , . . ., s k,pr ) and r is the number of parameter groups, and proceed to the next step with x k+1 .This model step, if needed, requires extra mini-batch function and gradient evaluations (forward and backward pass in deep neural networks).
For each parameter group p ∈ {p 1 , . . ., p r }, the quadratic model is built by combining the linear models at x k,p and x t k,p , given by respectively.Then, the quadratic model becomes where , is also imposed so that the minimum is attained in the region bounded by x k,p and x t k,p .This constraint acts like a trust region.Figure 1 shows the steps of this construction.
In this work, we solve a relaxation of this constrained model as explained in (Öztoprak and Birbil, 2018, Section 2.2) where one can find the full approach for finding the approximate solution of the constrained problem.The minimum value of the relaxed model is attained at the point x k,p + s k,p with where y k,p := g t k,p − g k,p .Here, the coefficients are given as where 0 < η < 1 is a constant.Then, the adaptive model step becomes s k = (s k,p1 , . . ., s k,pr ).We note that our construction in terms of different parameter groups lends itself to constructing a different model for each parameter subspace.
Figure 1: An iteration of SMB on a simple quadratic function.We assume for simplicity that there is only one parameter group, and hence, we drop the subscript p .The algorithm first computes the trial point x t k by taking the (stochastic) gradient step s t k .If this point is not acceptable, then it builds a model using the information at x k and x t k , and computes the next iterate x k+1 = x k + s k .Note that s k not only have a smaller length compared to the trial step s t k , but it also lies along a direction decreasing the function value.
We summarize the steps of SMB in Algorithm 1. Line 5 shows the trial point, which is obtained with the standard stochastic gradient step.If this step satisfies the stochastic Armijo condition, then we proceed with the next iteration (line 8).Otherwise, we continue with bulding the models for each parameter group (lines 11-13), and move to the next iteration with the model building step in line 14.
Algorithm 1: SMB: Stochastic Model Building An example run.It is not hard to see that SGD corresponds to steps 3-5 of Algorithm 1, and the SMB step can possibly reduce to an SGD step.Moreover, the SMB steps produced by Algorithm 1 always lie in the span of the two stochastic gradients, g k and g t k .In particular, when a model step is computed in line 13, we have , where α is a constant step length.Therefore, it is interesting to observe how the values of w 1 and w 2 evolve during the course of an SMB run, and how the resulting performance compares to taking SGD steps with various step lengths.For this purpose, we investigate the steps of SMB for one epoch on the MNIST dataset with a batch size of 128 (see Section 3 for details of the experimental setting).We provide in Figure 2 the values of w 1 and w 2 for SMB with α = 0.5 over the 468 steps taken in an epoch.Note that the computations of g t k in line 6 of Algorithm 1 may spend a significant portion of the evaluation budget, if model steps are taken very often.Figure 2 shows that SMB algorithm indeed takes too many model steps in this run as indicated by the frequency of positive w 2 values.To account for the extra gradient evaluations in computing the model steps, we run SGD with a constant learning rate of α on the same problem for two epochs rather than one.(The elapsed time of sequential runs on a PC with 8GB RAM vary in 8-9 seconds for SGD, and in 11-15 seconds for SMB).Table 1 presents a summary of the resulting training error and testing accuracy values.We observe that the performance of SMB is significantly more stable for different α values, thanks to the adaptive step length (and the modified search direction) provided by SMB.SGD can achieve performance values comparable to or even better than SMB, but only for the right values of α.In Figure 2, it is interesting to see that the values of w 2 are relatively small.We also realize that if we run SGD with a learning rate close to the average (w 1 + w 2 ) value, it has an inferior performance.For the SMB run with α = 0.5, for instance, the average (w 1 + w 2 ) value is close to −0.3.This can be contrasted with the resulting performance of SGD with α = 0.3 in Table 1.These observations suggest that g t k contributes to altering the search direction as intended, rather than acting as an additional stochastic gradient step.2. Convergence Analysis.The steps of SMB can be considered as a special quasi-Newton update: where H k is a symmetric positive definite matrix as an approximation to the inverse Hessian matrix.In Appendix 4, we explain this connection and give an explicit formula for the matrix H k .We also prove that there exists κ, κ > 0 such that for all k, the matrix where for two matrices A and B, A ⪯ B means B − A is positive semidefinite.It is important to note that H k is built with the information collected around x k , particularly, g k .Therefore, unlike stochastic quasi-Newton methods, H k is correlated with g k , and hence, Unfortunately, this difficulty prevents us from using the general framework given by Wang et al. (2017).
To overcome this difficulty and carry on with the convergence analysis, we modify Algorithm 1 such that H k is calculated with a new independent mini batch, and therefore, it is independent of g k .By doing so, we still build a model using the information around x k .Assuming that g k is an unbiased estimator of ∇f , we conclude that In the rest of this section, we provide a convergence analysis for this modified algorithm which we will call as SMBi ('i' stands for independent batch).The steps of SMBi are given in Algorithm 2. As Step 11 shows, we obtain the model building step with a new random batch.
Algorithm 2: SMBi: H k with an independent batch Before providing the analysis, let us make the following assumptions: Assumption 1: Assume that f : R n → R is continuously differentiable, lower bounded by f low , and there exists L > 0 such that for any x, y ∈ R n , ∥∇f (x) − ∇f (y)∥ ≤ L∥x − y∥.
Assumption 2: Assume that ξ k , k ≥ 1, are independent samples and for any iteration k, ξ k is independent of , for some M > 0. Although Assumption 1 is standard among the stochastic unconstrained optimization literature, one can find different variants of the Assumption 2 (see Khaled and Richtárik (2020) for an overview).In this paper, we follow the framework of Wang et al. (2017) which is a special case of Bottou et al. (2018).
In order to be in line with practical implementations and with our experiments, we first provide an analysis covering the constant step length case for (possibly) non-convex objective functions.Below, we denote by ξ [T ] = (ξ 1 , . . ., ξ T ) the random samplings in the first T iterations.Let α max be the maximum step length that is allowed in the implementation of SMBi with where 0 < η < 1.This hyper-parameter of maximum step length is needed in the theoretical results.Observe that since η −1 > 1, assuming L ≥ 1 implies that it suffices to choose α max ≥ 1 to satisfying (8).This implies further that 2/(Lη −1 + 2L 2 α max ) ≤ α max .The proof of the next convergence result is given in Appendix 4.
Theorem 2.1 Suppose that Assumption 1 and Assumption 2 hold and {x k } is generated by SMBi as given in Algorithm 2. Suppose also that {α k } in Algorithm 2 satisfies that 0 < α k < 2/(Lη −1 + 2L 2 α max ) ≤ α max for all k.For given T , let R be a random variable with the probability mass function , where D f := f (x 1 ) − f low and the expectation is taken with respect to R and ξ [T ] .Moreover, if we choose Using this theorem, it is possible to deduce that stochastic first-order oracle complexity of SMB with random output and constant step length is O(ϵ −2 ) (Wang et al., 2017, Corollary 2.12).In Wang et al. (2017) (Theorem 2.5), it is shown that under our assumptions above and the extra assumptions of 0 k < ∞, if the point sequence {x k } is generated by SMBi method (when H k is calculated by an independent batch in each step) with batch size m k = m for all k, then there exists a positive constant M f such that E[f (x k )] ≤ M f .Using this observation, the proofs of Theorem 2.1, and Theorem 2.8 in (Wang et al., 2017), we can also give the following complexity result when the step length sequence is diminishing.
Theorem 2.2 Suppose that Assumption 1 and Assumption 2 hold.Let the batch size m k = m for all k and assume that for some M f > 0, where T denotes the iteration number.Moreover, for a given ϵ ∈ (0, 1), to guarantee that 3. Numerical Experiments.In this section, we compare SMB and SMBi against Adam (Kingma and Ba, 2015), and SLS (SGD+Armijo) (Vaswani et al., 2019).We have chosen SLS, since it is a recent method that uses stochastic line search with backtracking.We have conducted experiments on multi-class classification problems using neural network models * .Our Python package SMB along with the scripts to conduct our experiments are available online: https://github.com/sibirbil/SMBWe start our experiments with constant stepsizes for all methods.We should point out that SLS method adjusts the stepsize after each backtracking process and also uses a stepsize reset algorithm between epochs.* The implementations of the models are taken from https://github.com/IssamLaradji/sls We refer to this routine as stepsize auto-scheduling.Our numerical experiments show that even without such an auto-scheduling the performances of our methods are on par with SLS.Following the experimental setup in [9], the default setting for hyperparameters of Adam and SLS is used and α 0 has been set to 1 for SLS and 0.001 for Adam.As regards SMB and SMBi, the constant learning rates have been fixed to 0.5, and the constant c = 0.1 as in SLS.Due to the high computational costs of training the neural networks, we report the results of a single run of each method.
MNIST dataset.On the MNIST dataset, we have used the one hidden-layer multi-layer perceptron (MLP) of width 1,000.In Figure 3, we see the best performances of all four methods on the MNIST dataset with respect to epochs and run time.The run time represents the total time cost of 100 epochs.Even though SMB and SMBi may calculate an extra function value (forward pass) and a gradient (backward pass), we see in this problem that SMB and SMBi achieve the best performance with respect to the run time as well as the number of epochs.More importantly, the generalization performances of SMB and SMBi are also better than the remaining three methods.
It should be pointed out that, in practice, choosing a new independent batch means the SMBi method can construct a model step in two iteration using two batches.This way the computation cost for each iteration is reduced on average with respect to SMB but the model steps can only be taken in half of the iterations in the epoch.As seen in Figure 3, this does not seem to effect the performance in this problem significantly.
CIFAR10 and CIFAR100 datasets.For the CIFAR10 and CIFAR100 datasets, we have used the standard image-classification architectures ResNet-34 (He et al., 2016) and DenseNet-121 (Huang et al., 2017).As before, we provide performances of all four methods with respect to epochs and run time.The run times represent the total time cost of 200 epochs.In Figure 4, we see that on CIFAR10-Resnet34, SMB performs better than Adam algorithm.However, its performance is only comparable to SLS.Even though SMB reaches a lower training loss value in CIFAR100-Resnet34, this advantage does not show in test accuracy.In Figure 5, we see a comparison of performances of on CIFAR10 and CIFAR100 with DenseNet121.SMB with a constant stepsize outperforms all other optimizers in terms of training error and reaches the best test accuracy on CIFAR100, while showing similar accuracy with ADAM on CIFAR10.
Our last set of experiments are devoted to demonstrating the robustness of SMB.The preliminary results in Figure 6 show that SMB is robust to the choice of the learning rate, especially in deep neural networks.This aspect of SMB needs more attention theoretically and experimentally.

Conclusion. Stochastic model building (SMB
) is a fast alternative to stochastic gradient descent method.The algorithm provides a model building approach that replaces the one-step backtracking in stochastic line search methods.We have analyzed the convergence properties of a modification of SMB by rewriting its model building step as a quasi-Newton update and constructing the scaling matrix with a new independent batch.Our numerical results have shown that SMB converges fast and its performance is insensitive to the selected step length.
In its current state, SMB lacks any internal learning rate adjusting mechanism that could reset the learning rate depending on the progression of the iterations.Our initial experiments show that SMB can greatly benefit from a step length auto-scheduling routine.This is a future work that we will consider.Our convergence rate analysis is given for the alternative algorithm SMBi which can perform competitive against other methods, but consistently underperforms the original SMB method.This begs for a convergence analysis for the SMB method.

APPENDIX
Proof of Theorem 2.1 First we show that the SMBi step for each parameter group p can be expressed as a special quasi-Newton update.For brevity, let us use s k , s t k , g k , g t k , and y k instead of s k,p , s t k,p , g k,p , g t k,p , and y k,p , respectively.Recalling the definitions of θ and δ given in (5), observe that Therefore, we have Now, it is easy to see that Thus, for each parameter group p, we define where , β p = σ p − y ⊤ k,p g k,p , and γ p = (β 2 p − ∥g k,p ∥ 2 ∥y k,p ∥ 2 ).Now, assuming that we have the parameter groups {p 1 , . . ., p n }, the SMB steps can be expressed as a quasi-Newton update given by where if the Armijo condition is satisfied; diag(H k,p1 , . . ., H k,pn ), otherwise.
We next show that the eigenvalues of the matrices H k , k ≥ 1, are bounded from above and below uniformly which is, of course, obvious when H k = I.Using the Sherman-Morrison formula twice, one can see that for each parameter group p, the matrix H k,p is indeed the inverse of the positive semidefinite matrix This implies that the eigenvalues of H k,p = B −1 k,p are bounded below by 1/(η −1 + 2Lα max ) and bounded above by 1 uniformly on k and p.This result, together with our assumptions, shows that steps of the SMBi algorithm satisfy the conditions of Theorem 2.10 in (Wang et al., 2017) with κ = 1/(η −1 + 2Lα max ) and κ = 1 and Theorem 2.1 follows as a corollary.

Figure 2 :
Figure 2: The coefficients of g k and g t k during a single-epoch run of SMB on the MNIST data with α = 0.5.Model steps are taken quite often, but not at all iterations.The sum of the two coefficients vary in [-0.5,-0.25].
Training and Test LossesTraining and Test Run Times w.r.t 100 epochs

Figure 3 :
Figure 3: Classification on MNIST with an MLP model.

Figure 6 :
Figure 6: Robustness of SMB under different choices of the learning rate.

Table 1 :
Performance on the MNIST data; SMB is run for one epoch, and SGD is run for two epochs.
σ p I − g k,p y ⊤ k,p − y k,p g ⊤ k,p ),and hence, it is also positive semidefinite.Therefore, it is enough to show the boundedness of the eigenvalues of B k,p uniformly on k and p.Since g k,p y ⊤ k,p + y k,p g ⊤ k,p is a rank two matrix, σ p /∥g k,p ∥ 2 is an eigenvalue of B k,p with multiplicity n − 2. The remaining extreme eigenvalues areλ max (B k,p ) = 1 ∥g k,p ∥ 2 (σ p + ∥g k,p ∥∥y k,p ∥ − y ⊤ k,p g k,p ) and λ min (B k,p ) = 1 ∥g k,p ∥ 2 (σ p − ∥g k,p ∥∥y k,p ∥ − y ⊤ k,p g k,p )with the corresponding eigenvectors ∥y k,p ∥g k,p + ∥g k,p ∥y k,p and ∥y k,p ∥g k,p − ∥g k,p ∥y k,p , respectively.Observe that,λ min (B k,p ) = σ p − ∥g k,p ∥∥y k,p ∥ − y ⊤ k,p g k,p ∥g k,p ∥ 2 = ∥g k,p ∥∥y k,p ∥ + η −1 ∥g k,p ∥ 2 + y ⊤ k,p g k,p − ∥g k,p ∥∥y k,p ∥ − y ⊤ k,p g k,p ∥g k,p ∥ 2 Thus, observing that ∥y k,p ∥ = ∥g t k,p − g k,p ∥ ≤ L∥x t k,p − x k,p ∥ ≤ α k L∥g k,p ∥, we have λ max (B k,p ) = σ p + ∥g k,p ∥∥y k,p ∥ − y ⊤ k,p g k,p ∥g k,p ∥ 2 = ∥g k,p ∥∥y k,p ∥ + η −1 ∥g k,p ∥ 2 + y ⊤ k,p g k,p + ∥g k,p ∥∥y k,p ∥ − y ⊤ k,p g k,p ∥g k,p ∥ 2 = 2∥g k,p ∥∥y k,p ∥ + η −1 ∥g k,p ∥ 2 ∥g k,p ∥ 2 Thus, the smallest eigenvalue B k,p is bounded away from zero uniformly on k and p.Now, by our assumption of Lipschitz continuity of the gradients, for any x, y ∈ R n and ξ k , we have ∥g(x, ξ k ) − g(y, ξ k )∥ ≤ L∥x − y∥.