Stochastic gradient descent (SGD) is a stochastic-approximation type optimization algorithm with several variants and a well-studied theory (Tadić 1997; Chen et al. 2023). It is a popular choice for machine learning applications; in practice, it can achieve fast convergence when its stepsize and its scheduling are tuned well for the specific application at hand. However, this tuning procedure can take up to thousands of CPU/GPU days resulting in big energy costs (Asi and Duchi 2019). A number of researchers have studied adaptive strategies for improving the direction and the step length choices of the stochastic gradient descent algorithm. Adaptive sample size selection ideas (Byrd et al. 2012; Balles et al. 2017; Bollapragada et al. 2018) improve the direction by reducing its variance around the negative gradient of the empirical loss function, while stochastic quasi-Newton algorithms (Byrd et al. 2016; Wang et al. 2017) provide adaptive preconditioning. Recently, several stochastic line search approaches have been proposed. Not surprisingly, some of these work cover sample size selection as a component of the proposed line search algorithms (Balles et al. 2017; Paquette and Scheinberg 2020).

The Stochastic Model Building (SMB) algorithm proposed in this paper is not designed as a stochastic quasi-Newton algorithm in the sense explained by Bottou et al. (2018). However, it still produces a scaling matrix in the process of generating trial points, and its overall step at each outer iteration can be written in the form of matrix–vector multiplication. Unlike the algorithms proposed by Mokhtari and Ribeiro (2014) and Schraudolph et al. (2007), we have no accumulation of curvature pairs throughout several iterations. Since there is no memory carried from earlier iterations, the scaling matrices in individual past iterations are based only on the data samples employed in those iterations. In other words, the scaling matrix and the incumbent random gradient vector are dependent. That being said, we also provide a version (SMBi), where the matrix and gradient vector in question become independent (see Algorithm 2).

Vaswani et al. (2019) apply a deterministic globalization procedure on mini-batch loss functions. That is, the same sample is used in all function and gradient evaluations needed to apply the line search procedure at a given iteration. However, unlike our case, they employ a standard line search procedure that does not alter the search direction. They establish convergence guarantees for the empirical loss function under the interpolation assumption, which requires each component loss function to have zero gradient at a minimizer of the empirical loss. Mutschler and Zell (2020) assume that the optimal learning rate (i.e., step length) along the negative batch gradient is a good estimator for the optimal learning rate with respect to the empirical loss along the same direction. They test validity of this assumption empirically on deep neural networks (DNNs). Rather than making such strong assumptions, we stick to the general theory for stochastic quasi-Newton methods.

Other work follow a different approach to translate deterministic line search procedures into a stochastic setting, and they do not employ fixed samples. In Mahsereci and Hennig (2017), a probabilistic model along the search direction is constructed via techniques from Bayesian optimization. Learning rates are chosen to maximize the expected improvement with respect to this model and the probability of satisfying Wolfe conditions. Paquette and Scheinberg (2020) suggest an algorithm closer to the deterministic counterpart, where the convergence is based on the requirement that the stochastic function and gradient evaluations approximate their true values with a high enough probability.

Finally, we should mention that the finite-sum minimization problem is a special case of the general expected value minimization problem, for which certain modification ideas for SGD regarding the selection of the search direction and the step length can be applicable. One such idea is gradient aggregation, which adds to the search direction of SGD a variance reducing component obtained via stochastic gradient evaluations at previous iterates (Roux et al. 2012; Defazio et al. 2014). In Malinovsky et al. (2022), an aggregated-gradient-type step is produced in a distributed setting where the overall step is produced by employing step lengths at two levels. Another idea is to use an extended step length control strategy depending on the objective value and the norm of the computed direction that might occasionally set the step length to zero (Liuzzi et al. 2022). However, it is not clear how these ideas can be extended to the more general case of expected value minimization.

With our current work, we make the following contributions. We use a model building strategy for adjusting the step length and the direction of a stochastic gradient vector. This approach also permits us to work on subsets of parameters. This feature makes our model steps not only adaptive, but also suitable to incorporate into the existing implementations of DNNs. Our method changes the direction of the step as well as its length. This property separates our approach from the backtracking line search algorithms. It also incorporates the most recent curvature information from the current point. This is in contrast with the stochastic quasi-Newton methods which use the information from the previous steps. Capitalizing our discussion on the independence of the sample batches, we also give a convergence analysis for SMB. Finally, we illustrate the computational performance of our method with a set of numerical experiments and compare the results against those obtained with other well-known methods.

1 Stochastic model building

We introduce a new stochastic unconstrained optimization algorithm in order to approximately solve problems of the form

$$\begin{aligned} \min _{x\in \Re ^n} \ \ f(x) = \mathbb {E}[F(x, \xi )], \end{aligned}$$
(1)

where \(F{:}\,\mathbb {R}^n \times \mathbb {R}^d \rightarrow \mathbb {R}\) is continuously differentiable and possibly nonconvex, \(\xi \in \mathbb {R}^d\) denotes a random variable, and \(\mathbb {E}[.]\) stands for the expectation taken with respect to \(\xi\). We assume the existence of a stochastic first-order oracle which outputs a stochastic gradient \(g(x, \xi )\) of f for a given x. A common approach to tackle (1) is to solve the empirical risk problem

$$\begin{aligned} \min _{x\in \Re ^n} \ \ f(x) =\frac{1}{N}\sum _{i=1}^{N} f_i(x), \end{aligned}$$
(2)

where \(f_i{:}\,\mathbb {R}^n \rightarrow \mathbb {R}\) is the loss function corresponding to the ith data sample, and N denotes the data sample size which can be very large in modern applications.

As an alternative approach to line search for SGD, we propose a stochastic model building strategy inspired by the work of Öztoprak and Birbil (2018). Unlike core SGD methods, our approach aims at including a curvature information that adjusts not only the step length but also the search direction. Öztoprak and Birbil (2018) consider only the deterministic setting and they apply the model building strategy repetitively until a sufficient descent is achieved. In our stochastic setting, however, we have observed experimentally that using multiple model steps does not benefit much to the performance, and its cost to the runtime can be extremely high in large-scale (e.g., deep learning) problems. Therefore, if the sufficient descent is not achieved by the stochastic gradient step, then we construct only one model to adjust the length and the direction of the step.

Conventional stochastic quasi-Newton methods adjust the gradient direction by a scaling matrix that is constructed by the information from the previous steps. Our model building approach, however, uses the most recent curvature information around the latest iteration. In popular deep learning model implementations, model parameters come in groups and updates are applied to each parameter group separately. Therefore, we also propose to build a model for each parameter group separately making the step lengths adaptive.

The proposed iterative algorithm SMB works as follows: At step k, given the iterate \(x_k\), we calculate the stochastic function value \(f_k = f(x_k, \xi _{k})\) and the mini-batch stochastic gradient \(g_k = \frac{1}{m_k}\sum _{i=1}^{m_k}g(x_k, \xi _{k,i})\) at \(x_k\), where \(m_k\) is the batch size, and \(\xi _k = (\xi _{k,1},\ldots ,\xi _{k,m_k})\) is the realization of the random vector \(\xi\). Then, we apply the SGD update to calculate the trial step \(s_k^t = - \alpha _k g_k\), where \(\{\alpha _k\}_k\) is a sequence of learning rates. With this trial step, we also calculate the function and gradient values \(f^t_k = f(x^t_k, \xi _{k})\) and \(g^t_k = g(x^t_k, \xi _{k})\) at \(x^t_k = x_k + s^t_k\). Then, we check the stochastic Armijo condition

$$\begin{aligned} f^t_k \le f_k - c \ \alpha _k \Vert g_k\Vert ^2, \end{aligned}$$
(3)

where \(c > 0\) is a hyper-parameter. If the condition is satisfied and we achieve sufficient decrease, then we set \(x_{k+1} = x^t_k\) as the next step. If the Armijo condition is not satisfied, following Öztoprak and Birbil (2018), we build a quadratic model using the linear models at the points \(x_{k,p}\) and \(x^t_{k,p}\) for each parameter group p and find the step \(s_{k,p}\) to reach its minimum point. Here, \(x_{k,p}\) and \(x^t_{k,p}\) denote respectively the coordinates of \(x_{k}\) and \(x^t_{k}\) that correspond to the parameter group p. We calculate the next iterate \(x_{k+1} = x_k + s_k\), where \(s_k = (s_{k,p_1}, \ldots , s_{k,p_r})\) and r is the number of parameter groups, and proceed to the next step with \(x_{k+1}\). This model step, if needed, requires extra mini-batch function and gradient evaluations (forward and backward pass in deep neural networks).

For each parameter group \(p \in \{p_1, \ldots , p_r\}\), the quadratic model is built by combining the linear models at \(x_{k,p}\) and \(x^t_{k,p}\), given by

$$\begin{aligned} l_{k,p}^0(s):= f_{k} + g_{k,p}^\top s \ \ \ \hbox { and } \ \ \ l_{k,p}^t(s-s^t_{k,p}):= f^t_{k} + (g^t_{k,p})^\top (s-s^t_{k,p}), \end{aligned}$$

respectively. Then, the quadratic model becomes

$$\begin{aligned} m^t_{k,p}(s) = \alpha _{k,p}\ell ^0_{k,p} + (1 - \alpha _{k,p})\ell ^t_{k,p}, \end{aligned}$$

where

$$\begin{aligned} \alpha _{k,p} = -\frac{(s - s^t_{k,p})^\top s^t_{k,p}}{\Vert s^t_{k,p}\Vert ^2}. \end{aligned}$$

The constraint

$$\begin{aligned} \Vert s\Vert ^2 + \Vert s-s^t_{k,p}\Vert ^2 \le \Vert s^t_{k,p}\Vert ^2, \end{aligned}$$

is also imposed so that the minimum is attained in the region bounded by \(x_{k,p}\) and \(x^t_{k,p}\). This constraint acts like a trust region. Figure 1 shows the steps of this construction.

Fig. 1
figure 1

An iteration of SMB on a simple quadratic function. We assume for simplicity that there is only one parameter group, and hence, we drop the subscript p. The algorithm first computes the trial point \(x_k^t\) by taking the (stochastic) gradient step \(s_k^t\). If this point is not acceptable, then it builds a model using the information at \(x_k\) and \(x_k^t\), and computes the next iterate \(x_{k+1}=x_k+s_k\). Note that \(s_k\) not only have a smaller length compared to the trial step \(s_k^t\), but it also lies along a direction decreasing the function value

In this work, we solve a relaxation of this constrained model as explained in Öztoprak and Birbil (2018, Section 2.2) where one can find the full approach for finding the approximate solution of the constrained problem. The minimum value of the relaxed model is attained at the point \(x_{k,p} + s_{k,p}\) with

$$\begin{aligned} s_{k,p} = c_{g,p} (\delta ) g_{k,p} + c_{y,p} (\delta ) y_{k,p} + c_{s,p} (\delta ) s^t_{k,p}, \end{aligned}$$
(4)

where \(y_{k,p}:= g^t_{k,p} - g_{k,p}\). Here, the coefficients are given as

$$\begin{aligned}{} & {} c_{g,p} (\delta ) = -\frac{\Vert s_{k,p}^t\Vert ^2}{\delta }, \quad c_{y,p} (\delta ) = -\frac{\Vert s_{k,p}^t\Vert ^2}{\delta \theta }\left[ -(y_{k,p}^\top s_{k,p}^t + \delta )(s_{k,p}^t)^\top g_{k,p} + \Vert s_{k,p}^t\Vert ^2 y_{k,p}^\top g_{k,p}\right] ,\\{} & {} c_{s,p}(\delta ) = -\frac{\Vert s_{k,p}^t\Vert ^2}{\delta \theta }\left[ -(y_{k,p}^\top s_{k,p}^t + \delta )y_{k,p}^\top g_{k,p} + \Vert y_{k,p}\Vert ^2(s_{k,p}^t)^\top g_{k,p}\right] , \end{aligned}$$

with

$$\begin{aligned} { \theta = \left( y_{k,p}^\top s_{k,p}^t + 2\delta \right) ^2-\Vert s_{k,p}^t\Vert ^2\Vert y_{k,p}\Vert ^2 \ \hbox { and } \ \delta = \Vert s_{k,p}^t\Vert \left( \Vert y_{k,p}\Vert +\frac{1}{\eta }\Vert g_{k,p}\Vert \right) - y_{k,p}^\top s_{k,p}^t,} \end{aligned}$$
(5)

where \(0< \eta < 1\) is a constant. Then, the adaptive model step becomes \(s_k = (s_{k,p_1}, \ldots , s_{k,p_r})\). We note that our construction in terms of different parameter groups lends itself to constructing a different model for each parameter subspace.

We summarize the steps of SMB in Algorithm 1. Line 5 shows the trial point, which is obtained with the standard stochastic gradient step. If this step satisfies the stochastic Armijo condition, then we proceed with the next iteration (line 8). Otherwise, we continue with building the models for each parameter group (lines 11–13), and move to the next iteration with the model building step in line 14.

Algorithm 1
figure a

SMB: Stochastic Model Building

An example run It is not hard to see that SGD corresponds to steps 3–5 of Algorithm 1, and the SMB step can possibly reduce to an SGD step. Moreover, the SMB steps produced by Algorithm 1 always lie in the span of the two stochastic gradients, \(g_k\) and \(g_k^t\). In particular, when a model step is computed in line 13, we have

$$\begin{aligned} s_{k}=w_1g_k+w_2g_k^t \text { with } w_1 = c_g(\delta )-c_y(\delta )-c_s(\delta )\alpha \text { and } w_2 = c_y(\delta ), \end{aligned}$$

where \(\alpha\) is a constant step length. Therefore, it is interesting to observe how the values of \(w_1\) and \(w_2\) evolve during the course of an SMB run, and how the resulting performance compares to taking SGD steps with various step lengths. For this purpose, we investigate the steps of SMB for one epoch on the MNIST dataset with a batch size of 128 (see Sect. 3 for details of the experimental setting).

We provide in Fig. 2 the values of \(w_1\) and \(w_2\) for SMB with \(\alpha =0.5\) over the 468 steps taken in an epoch. Note that the computations of \(g_k^t\) in line 6 of Algorithm 1 may spend a significant portion of the evaluation budget, if model steps are taken very often. Figure 2 shows that SMB algorithm indeed takes too many model steps in this run as indicated by the frequency of positive \(w_2\) values. To account for the extra gradient evaluations in computing the model steps, we run SGD with a constant learning rate of \(\alpha\) on the same problem for two epochs rather than one. (The elapsed time of sequential runs on a PC with 8GB RAM vary in 8–9 s for SGD, and in 11–15 s for SMB). Table 1 presents a summary of the resulting training error and testing accuracy values. We observe that the performance of SMB is significantly more stable for different \(\alpha\) values, thanks to the adaptive step length (and the modified search direction) provided by SMB. SGD can achieve performance values comparable to or even better than SMB, but only for the right values of \(\alpha\). In Fig. 2, it is interesting to see that the values of \(w_2\) are relatively small. We also realize that if we run SGD with a learning rate close to the average \((w_1+w_2)\) value, it has an inferior performance. For the SMB run with \(\alpha =0.5\), for instance, the average \((w_1+w_2)\) value is close to \(-0.3\). This can be contrasted with the resulting performance of SGD with \(\alpha =0.3\) in Table 1. These observations suggest that \(g_k^t\) contributes to altering the search direction as intended, rather than acting as an additional stochastic gradient step.

Table 1 Performance on the MNIST data; SMB is run for one epoch, and SGD is run for two epochs
Fig. 2
figure 2

The coefficients of \(g_k\) and \(g_k^t\) during a single-epoch run of SMB on the MNIST data with \(\alpha =0.5\). Model steps are taken quite often, but not at all iterations. The sum of the two coefficients vary in [−0.5, −0.25]

2 Convergence analysis

The steps of SMB can be considered as a special quasi-Newton update:

$$\begin{aligned} x_{k+1} = x_k -\alpha _k H_k g_k, \end{aligned}$$
(6)

where \(H_k\) is a symmetric positive definite matrix as an approximation to the inverse Hessian matrix. In Appendix, we explain this connection and give an explicit formula for the matrix \(H_k\). We also prove that there exists \(\underline{\kappa }, \overline{\kappa } > 0\) such that for all k, the matrix \(H_k\) satisfies

$$\begin{aligned} \underline{\kappa } I \preceq H_k \preceq \overline{\kappa } I, \end{aligned}$$
(7)

where for two matrices A and B, \(A \preceq B\) means \(B - A\) is positive semidefinite. It is important to note that \(H_k\) is built with the information collected around \(x_k\), particularly, \(g_k\). Therefore, unlike stochastic quasi-Newton methods, \(H_k\) is correlated with \(g_k\), and hence, \(\mathbb {E}_{\xi _k}[H_k g_k]\) is very difficult to analyze. Unfortunately, this difficulty prevents us from using the general framework given by Wang et al. (2017).

To overcome this difficulty and carry on with the convergence analysis, we modify Algorithm 1 such that \(H_k\) is calculated with a new independent mini batch, and therefore, it is independent of \(g_k\). By doing so, we still build a model using the information around \(x_k\). Assuming that \(g_k\) is an unbiased estimator of \(\nabla f\), we conclude that \(\mathbb {E}_{\xi _k}[H_kg_k] = H_k \nabla f\). In the rest of this section, we provide a convergence analysis for this modified algorithm which we will call as SMBi (‘i’ stands for independent batch). The steps of SMBi are given in Algorithm 2. As Step 11 shows, we obtain the model building step with a new random batch.

Algorithm 2
figure b

SMBi: \(H_k\) with an independent batch

Before providing the analysis, let us make the following assumptions:

Assumption 1

Assume that \(f{:}\,\mathbb {R}^n \rightarrow \mathbb {R}\) is continuously differentiable, lower bounded by \(f^{low}\), and there exists \(L > 0\) such that for any \(x,y \in \mathbb {R}^n\), \(\Vert \nabla f(x) - \nabla f(y)\Vert \le L \Vert x-y\Vert\).

Assumption 2

Assume that \(\xi _k\), \(k \ge 1\), are independent samples and for any iteration k, \(\xi _k\) is independent of \(\{x_j\}_{j=1}^k\), \(\mathbb {E}_{\xi _k}[g(x_k, \xi _k)] = \nabla f(x_k)\) and \(\mathbb {E}_{\xi _k}[\Vert g(x_k, \xi _k) - \nabla f(x_k)\Vert ^2] \le M^2\), for some \(M > 0\).

Although Assumption 1 is standard among the stochastic unconstrained optimization literature, one can find different variants of the Assumption 2 (see Khaled and Richtárik (2020) for an overview). In this paper, we follow the framework of Wang et al. (2017) which is a special case of Bottou et al. (2018).

In order to be in line with practical implementations and with our experiments, we first provide an analysis covering the constant step length case for (possibly) non-convex objective functions. Below, we denote by \(\xi _{[T]} = (\xi _1, \ldots , \xi _T)\) the random samplings in the first T iterations. Let \(\alpha _{max}\) be the maximum step length that is allowed in the implementation of SMBi with

$$\begin{aligned} \alpha _{max} \ge \frac{-1 + \sqrt{1+16\eta ^2}}{4L\eta }, \end{aligned}$$
(8)

where \(0< \eta < 1\). This hyper-parameter of maximum step length is needed in the theoretical results. Observe that since \(\eta ^{-1} > 1\), assuming \(L \ge 1\) implies that it suffices to choose \(\alpha _{max} \ge 1\) to satisfying (8). This implies further that \(2/(L\eta ^{-1} + 2\,L^2\alpha _{max}) \le \alpha _{max}\). The proof of the next convergence result is given in Appendix.

Theorem 2.1

Suppose that Assumptions 1 and 2 hold and \(\{x_k\}\) is generated by SMBi as given in Algorithm 2. Suppose also that \(\{\alpha _k\}\) in Algorithm 2 satisfies that \(0< \alpha _k < 2/(L\eta ^{-1} + 2\,L^2\alpha _{max}) \le \alpha _{max}\) for all k. For given T, let R be a random variable with the probability mass function

$$\begin{aligned} \mathbb {P}_R(k):= \mathbb {P}\{R=k\} = \frac{\alpha _k / (\eta ^{-1} + 2L\alpha _{max}) - \alpha ^2_kL / 2}{\sum _{k=1}^T (\alpha _k / (\eta ^{-1} + 2L\alpha _{max}) - \alpha ^2_kL / 2)} \end{aligned}$$

for \(k = 1, \ldots , T\). Then, we have

$$\begin{aligned} \mathbb {E}[\Vert \nabla f(x_R)\Vert ^2] \le \frac{D_f + (M^2L/ 2) \sum _{k=1}^T (\alpha _k^2/m_k)}{\sum _{k=1}^T (\alpha _k / (\eta ^{-1} + 2L\alpha _{max}) - \alpha ^2_kL / 2)}, \end{aligned}$$

where \(D_f:= f(x_1) - f^{low}\) and the expectation is taken with respect to R and \(\xi _{[T]}\). Moreover, if we choose \(\alpha _k = 1/(L\eta ^{-1} + 2\,L^2\alpha _{max})\) and \(m_k = m\) for all \(k = 1, \ldots , T\), then this reduces to

$$\begin{aligned} \mathbb {E}[\Vert \nabla f(x_R)\Vert ^2] \le \frac{2L(\eta ^{-1} + 2L\alpha _{max})^2 D _f}{T} + \frac{M^2}{m}. \end{aligned}$$

Using this theorem, it is possible to deduce that stochastic first-order oracle complexity of SMB with random output and constant step length is \(\mathcal {O}(\epsilon ^{-2})\) (Wang et al. 2017, Corollary 2.12). In Wang et al. (2017) (Theorem 2.5), it is shown that under our assumptions above and the extra assumptions of \(0 < \alpha _k \le \frac{1}{L (\eta ^{-1} + 2\,L\alpha _{max})} \le \alpha _{max}\), \(\sum _{k=1}^{\infty } \alpha _k = \infty\) and \(\sum _{k=1}^{\infty } \alpha _k^2 < \infty\), if the point sequence \(\{x_k\}\) is generated by SMBi method (when \(H_k\) is calculated by an independent batch in each step) with batch size \(m_k = m\) for all k, then there exists a positive constant \(M_f\) such that \(\mathbb {E}[f(x_k)] \le M_f\). Using this observation, the proofs of Theorem 2.1, and Theorem 2.8 in Wang et al. (2017), we can also give the following complexity result when the step length sequence is diminishing.

Theorem 2.2

Suppose that Assumption 1 and Assumption 2 hold. Let the batch size \(m_k = m\) for all \(k\) and assume that \(\alpha _k = \frac{1}{L (\eta ^{-1} + 2\,L\alpha _{max})} k^{-\phi }\) with \(\phi \in (0.5, 1)\) for all k. Then \(\{x_k\}\) generated by SMBi satisfies

$$\begin{aligned} \frac{1}{T} \sum _{k=1}^T \mathbb {E}[\Vert \nabla f(x_k)\Vert ^2] \le 2L (\eta ^{-1} + 2L\alpha _{max})(M_f - f^{low}) T^{\phi - 1} + \frac{M^2}{(1-\phi )m}(T^{-\phi } - T^{-1}) \end{aligned}$$

for some \(M_f > 0\), where T denotes the iteration number. Moreover, for a given \(\epsilon \in (0,1)\), to guarantee that \(\frac{1}{T} \sum _{k=1}^T \mathbb {E}[\Vert \nabla f(x_k)\Vert ^2] < \epsilon\), the number of required iterations T is at most \(O\left( \epsilon ^{-\frac{1}{1-\phi }}\right)\).

3 Numerical experiments

In this section, we compare SMB and SMBi against Adam (Kingma and Ba 2015), and SLS (SGD+Armijo) (Vaswani et al. 2019). We have chosen SLS, since it is a recent method that uses stochastic line search with backtracking. We have conducted experiments on multi-class classification problems using neural network models.Footnote 1 Our Python package SMB along with the scripts to conduct our experiments are available online: https://github.com/sibirbil/SMB

We start our experiments with constant stepsizes for all methods. We should point out that SLS method adjusts the stepsize after each backtracking process and also uses a stepsize reset algorithm between epochs. We refer to this routine as stepsize auto-scheduling. Our numerical experiments show that even without such an auto-scheduling the performances of our methods are on par with SLS. Following the experimental setup in He et al. (2016), the default setting for hyperparameters of Adam and SLS is used and \(\alpha _0\) has been set to 1 for SLS and 0.001 for Adam. As regards SMB and SMBi, the constant learning rates have been fixed to 0.5, and the constant \(c = 0.1\) as in SLS. Due to the high computational costs of training the neural networks, we report the results of a single run of each method.

MNIST dataset On the MNIST dataset, we have used the one hidden-layer multi-layer perceptron (MLP) of width 1,000.

In Fig. 3, we see the best performances of all four methods on the MNIST dataset with respect to epochs and run time. The run time represents the total time cost of 100 epochs. Even though SMB and SMBi may calculate an extra function value (forward pass) and a gradient (backward pass), we see in this problem that SMB and SMBi achieve the best performance with respect to the run time as well as the number of epochs. More importantly, the generalization performances of SMB and SMBi are also better than the remaining three methods.

Fig. 3
figure 3

Classification on MNIST with an MLP model

It should be pointed out that, in practice, choosing a new independent batch means the SMBi method can construct a model step in two iteration using two batches. This way the computation cost for each iteration is reduced on average with respect to SMB but the model steps can only be taken in half of the iterations in the epoch. As seen in Fig. 3, this does not seem to effect the performance in this problem significantly.

CIFAR10 and CIFAR100 datasets For the CIFAR10 and CIFAR100 datasets, we have used the standard image-classification architectures ResNet-34 (He et al. 2016) and DenseNet-121 (Huang et al. 2017). As before, we provide performances of all four methods with respect to epochs and run time. The run times represent the total time cost of 200 epochs.

In Fig. 4, we see that on CIFAR10-Resnet34, SMB performs better than Adam algorithm. However, its performance is only comparable to SLS. Even though SMB reaches a lower training loss value in CIFAR100-Resnet34, this advantage does not show in test accuracy.

Fig. 4
figure 4

Classification on CIFAR10 (left column) and CIFAR100 (right column) with ResNet-34 model

In Fig. 5, we see a comparison of performances of on CIFAR10 and CIFAR100 with DenseNet121. SMB with a constant stepsize outperforms all other optimizers in terms of training error and reaches the best test accuracy on CIFAR100, while showing similar accuracy with ADAM on CIFAR10.

Fig. 5
figure 5

Classification on CIFAR10 (left column) and CIFAR100 (right column) with Densenet121 model

Our last set of experiments are devoted to demonstrating the robustness of SMB. The preliminary results in Fig. 6 show that SMB is robust to the choice of the learning rate, especially in deep neural networks. This aspect of SMB needs more attention theoretically and experimentally.

Fig. 6
figure 6

Robustness of SMB under different choices of the learning rate

4 Conclusion

Stochastic model building (SMB) is a fast alternative to stochastic gradient descent method. The algorithm provides a model building approach that replaces the one-step backtracking in stochastic line search methods. We have analyzed the convergence properties of a modification of SMB by rewriting its model building step as a quasi-Newton update and constructing the scaling matrix with a new independent batch. Our numerical results have shown that SMB converges fast and its performance is insensitive to the selected step length.

In its current state, SMB lacks any internal learning rate adjusting mechanism that could reset the learning rate depending on the progression of the iterations. Our initial experiments show that SMB can greatly benefit from a step length auto-scheduling routine. This is a future work that we will consider. Our convergence rate analysis is given for the alternative algorithm SMBi which can perform competitive against other methods, but consistently underperforms the original SMB method. This begs for a convergence analysis for the SMB method.