# Scalable estimation strategies based on stochastic approximations: classical results and new insights

## Abstract

Estimation with large amounts of data can be facilitated by stochastic gradient methods, in which model parameters are updated sequentially using small batches of data at each step. Here, we review early work and modern results that illustrate the statistical properties of these methods, including convergence rates, stability, and asymptotic bias and variance. We then overview modern applications where these methods are useful, ranging from an online version of the EM algorithm to deep learning. In light of these results, we argue that stochastic gradient methods are poised to become benchmark principled estimation procedures for large datasets, especially those in the family of stable proximal methods, such as implicit stochastic gradient descent.

### Keywords

Maximum likelihood Recursive estimation Implicit stochastic gradient descent methods Optimal learning rate Asymptotic analysis Big data## 1 Introduction

Parameter estimation by optimization of an objective function, such as maximum likelihood and maximum a-posteriori, is a fundamental idea in statistics and machine learning (Fisher 1922; Lehmann and Casella 2003; Hastie et al. 2011). However, widely used optimization-based estimation algorithms, such as Fisher scoring, the EM algorithm, and iteratively reweighted least squares (Fisher 1925a; Dempster et al. 1977; Green 1984), are not scalable to modern datasets with hundreds of millions of data points and hundreds of thousands of covariates (National Research Council 2013).

*iterative*and have a running-time complexity that ranges between \(\mathcal {O}(Np^3)\) and \(\mathcal {O}(Np)\), in worst cases and best cases, respectively. Newton–Raphson methods, for instance, update an estimate \(\varvec{\theta }_{n-1}^{\mathrm {nr}}\) of the parameters through the recursion

Quasi-Newton (QN) methods are a powerful alternative and are widely used in practice. In QN methods, the Hessian is approximated by a low-rank matrix that is updated at each iteration as new values of the gradient become available, thus yielding algorithms with complexity \(\mathcal {O}(Np^2)\) or \(\mathcal {O}(Np)\) in certain favorable cases (Hennig and Kiefel 2013). Other general estimation algorithms such as EM or iteratively reweighted least squares (Green 1984) involve computations (e.g., inversions or maximizations between iterations) that are significantly more expensive than QN methods.

*first-order*information i.e., methods that utilize only gradient computations.

^{1}Such performance is achieved by the

*stochastic gradient descent*(SGD) algorithm, which was initially proposed by Sakrison (1965) as a

*recursive estimation method*, albeit not in first-order form. A typical first-order SGD is defined by the iteration

*explicit updates*, or

*explicit SGD*for short, because the next iterate \(\varvec{\theta }_{n}^{\mathrm {sgd}}\) can be computed immediately after the new data point \(\varvec{y}_n\) is observed.

^{2}The sequence \(a_n>0\) is a carefully chosen

*learning rate*sequence which is typically defined such that \(n a_n\rightarrow \alpha > 0\) as \(n \rightarrow \infty \). The parameter \(\alpha >0\) is the

*learning rate parameter*, and it is crucial for the convergence and stability of explicit SGD.

From a computational perspective, the SGD procedure (2) is appealing because the expensive inversion of \(p\times p\) matrices, as in Newton–Raphson, is replaced by a single sequence \(a_n >0\). Furthermore, the log-likelihood is evaluated at a single observation \(\varvec{y}_n\), and not on the entire dataset \({\varvec{Y}}\). Necessarily this incurs information loss which is important to quantify. From a theoretical perspective the explicit SGD updates are justified because, under typical regularity conditions, \(\mathbb {E}\left( \nabla \ell (\varvec{\theta _\star }; \varvec{y}_{n}) \right) =0 \) and thus \(\varvec{\theta }_n\rightarrow \varvec{\theta _\star }\) by the properties of the Robbins–Monro procedure (Robbins and Monro 1951). However, the explicit SGD procedure requires careful tuning of the learning rate parameter; small values of \(\alpha \) will make the iteration (2) very slow to converge, whereas for large values of \(\alpha \) explicit SGD will either have a large asymptotic variance, or even diverge numerically. As a recursive estimation method, explicit SGD was first proposed by Sakrison (1965) and has attracted attention in the machine learning community as a fast prediction method for large-scale problems (Le et al. 2004; Zhang 2004).

*implicit SGD*procedure through the iteration

*implicit*because the next iterate \(\varvec{\theta }_{n}^{\mathrm {im}}\) appears in both sides of the equation.

^{3}This simple tweak of the explicit SGD procedure has quite remarkable statistical properties. In particular, assuming a common starting point \(\varvec{\theta }_{n-1}^{\mathrm {sgd}} = \varvec{\theta }_{n-1}^{\mathrm {im}} \triangleq \varvec{\theta }\), one can show through a simple Taylor approximation of (3) around \(\varvec{\theta }\), that the implicit update satisfies

*observed*Fisher information matrix. Thus, the implicit SGD procedure calculates updates that are a

*shrinked*version of the explicit ones. In contrast to explicit SGD, implicit SGD is significantly more stable in small-samples, and it is also robust to misspecifications of the learning rate parameter \(\alpha \). Furthermore, implicit SGD computes iterates that belong in the support of the parameter space, whereas explicit SGD would normally require an additional projection step. Arguably, the normalized least mean squares (NLMS) filter (Nagumo and Noda 1967) was the first statistical model that used an implicit update as in Eq. (3) and was shown to be consistent and robust to input noise (Slock 1993). Theoretical justification for implicit SGD comes either from implicit variations of the Robbins–Monro procedure (Toulis et al. 2014), or through

*proximal methods*in optimization (Parikh and Boyd 2013), such as mirror-descent (Nemirovski 1983; Beck and Teboulle 2003). Assuming differentiability of the log-likelihood, the implicit SGD update (3) can be expressed as a proximal method through the solution of

*proximal operator*. The update in Eq. (5) is the stochastic version of the deterministic proximal point algorithm by Rockafellar (1976), and has been analyzed recently, in various forms, for convergence and stability (Ryu and Boyd 2014; Rosasco et al. 2014). Recent work has established the consistency of certain implicit methods similar to (3) (Kivinen and Warmuth 1995; Kivinen et al. 2006; Kulis and Bartlett 2010) and their robustness has been useful in a range of modern machine learning problems (Nemirovski et al. 2009; Kulis and Bartlett 2010; Schuurmans and Caelli 2007).

The structure of this chapter is as follows. In Sect. 2 we give an overview of the Robbins–Monro procedure and Sakrison’s recursive estimation method, which form the theoretical basis of SGD methods; we further provide a quick overview of early results on the statistical efficiency of the aforementioned methods. In Sect. 3, we formally introduce explicit and implicit SGD, and treat those procedures as *statistical estimation methods* that provide an estimator \(\varvec{\theta }_n\) of the model parameters \(\varvec{\theta _\star }\) after \(n\) iterations. In Sect. 3.1 we give results on the frequentist statistical properties of SGD estimators i.e., their asymptotic bias and asymptotic variance across multiple realizations of the dataset \({\varvec{Y}}\). We then leverage those results to study optimal learning rate sequences \(a_n\) (Sect. 3.4), the loss of statistical efficiency in SGD and ways to fix it through reparameterization (Sect. 3.3). We briefly discuss stability in Sect. 3.2. In Sect. 3.5, we present significant extensions to first-order SGD, namely averaged SGD, variants of second-order SGD, and Monte-Carlo SGD. Finally, in Sect. 4, we review significant applications of SGD in various areas of statistics and machine learning, namely in online EM, MCMC posterior sampling, reinforcement learning, and deep learning.

## 2 Stochastic approximations

### 2.1 Robbins and Monro’s procedure

*Robbins–Monro procedure*, in which an estimate \(\theta _{n-1}\) of \(\theta _\star \) is utilized to sample one new data point \(y_n\) such that \(\mathbb {E}\left( y_n \right| \theta _{n-1}) = M(\theta _{n-1})\); the estimate is then updated according to the following simple rule:

- (a)
\((x-\theta _\star ) M(x)> 0\) for \(x\) in a neighborhood of \(\theta _\star \),

- (b)
\(\mathbb {E}\left( y_n^2 \right| \theta ) < \infty \) for any \(\theta \), and

- (c)
\(\sum _{i=1}^\infty a_i= \infty \) and \(\sum _{i=1}^\infty a_i^2 < \infty \).

*adaptive*stochastic approximation methods, such as the Venter process (Venter 1967), in which quantities that are important for the convergence of the stochastic process (e.g., the quantity \(M'(\theta _\star )\)) are also being estimated along the way.

### 2.2 Sakrison’s recursive estimation method

*recursive estimation method*was essentially one of the first

*explicit*SGD method proposed in the literature:

## 3 Estimation with stochastic gradient methods

*second-order explicit SGD*and

*second-order implicit SGD*, respectively. When \({\varvec{C}}_n = a_n {\varvec{I}}\) i.e., it is the scaled identity matrix for some sequence \(a_n >0\) satisfying the Robbins–Monro conditions, we will refer to (10) and (11) as

*first-order explicit SGD*and

*first-order implicit SGD*, respectively; in this case, definitions (10) and (11) are identical to definitions (2) and (3) in the introduction. In some cases, we will consider models in the exponential family under the

*natural parameterization*with density

*frequentist*evaluation of SGD as a statistical estimation method i.e., we will consider \(\varvec{\theta }_{n}^{\mathrm {sgd}}\) (or \(\varvec{\theta }_{n}^{\mathrm {im}}\)) to be an

*estimator*of \(\varvec{\theta _\star }\), and we will focus on its bias and variance across multiple realizations of the dataset \({\varvec{Y}} = \{\varvec{y}_1, \varvec{y}_2, \ldots , \varvec{y}_n \}\), under the same model and parameter \(\varvec{\theta _\star }\).

^{4}

### 3.1 Asymptotic bias and variance

*exploration phase*(or search phase) and the

*convergence phase*(Amari 1998; Benveniste et al. 2012). In the exploration phase the iterates rapidly approach \(\varvec{\theta _\star }\), whereas in the convergence phase they jitter around \(\varvec{\theta _\star }\) within a ball of slowly decreasing radius. We will overview a typical analysis of SGD in the final convergence phase in which we assume that a Taylor approximation in the neighborhood of \(\varvec{\theta _\star }\) is accurate (Murata 1998; Toulis et al. 2014). In particular let \(\varvec{\mu }(\varvec{\theta }) = \mathbb {E}\left( \nabla \ell (\varvec{\theta }; \varvec{y}_{n}) \right) \), and assume that

*any*specification of positive-definite \({\varvec{C}}_n\), the eigenvalues of \(({\varvec{I}} + {\varvec{C}}_n \varvec{\mathcal {I}}(\varvec{\theta _\star }))^{-1}\) are less than one, and thus implicit SGD is

*unconditionally stable*; we will discuss more about stability in Sect. 3.4.

Asymptotic variance results similar to (18) were first studied in the stochastic approximation literature by Chung (1954), Sacks (1958), and followed by Fabian (1968) and several other authors (see also Ljung et al. 1992, Parts I, II), but not in a closed-form (18), as most analyses were not done under the context of recursive statistical estimation. Furthermore, Sakrison’s asymptotic efficiency result (Sakrison 1965) can be recovered by setting \({\varvec{C}}_n = (1/n) \varvec{\mathcal {I}}(\varvec{\theta }_{n-1})^{-1}\); in this case the asymptotic variance for both estimators is \((1/n) \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}\) i.e., it is the optimal asymptotic efficiency of the maximum-likelihood estimator.

### 3.2 Stability issues

*asymptotically stable*. However, they have significant differences in small-to-moderate samples. For simplicity, let us compare the two SGD procedures in their first-order formulation where \({\varvec{C}}_n = a_n {\varvec{I}}\) and \(a_n = \alpha /n\) for some \(\alpha > 0\).

In explicit SGD, the eigenvalues of \({\varvec{P}}_{1}^{n}\) can be calculated as \(\lambda _i' = \prod _j (1-\alpha \lambda _i/j) = \mathcal {O}(n^{-\alpha \lambda _i})\), for \(0 < \alpha \lambda _i <1 \), where \(\lambda _i\) are the eigenvalues of the Fisher information matrix \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\). Thus, the magnitude of \({\varvec{P}}_{1}^{n}\) will be dominated by \({\lambda }_{\max }\), the maximum eigenvalue of \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\), and the rate of convergence to zero will be dominated by \(\lambda _{\min }\), the minimum eigenvalue of \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\). The condition \(\alpha {\lambda }_{\max }\le 1 \Rightarrow \alpha \le 1/{\lambda }_{\max }\) is required for stability, but for fast convergence we require \(\alpha \lambda _{\min }\approx 1\). In high-dimensional settings, this could be the source of serious problems because \({\lambda }_{\max }\) could be at the order of \(p\) i.e., the number of model parameters. Thus, in explicit SGD the requirements for stability and speed of convergence are in conflict. A conservative learning rate sequence can guarantee stability but this comes at a price in convergence which will be at the order of \(\mathcal {O}(n^{-\alpha \lambda _{\min }})\), and vice versa. In stark contrast, the implicit procedure is *unconditionally stable*. The eigenvalues of \({\varvec{Q}}_{1}^{n}\) are \(\lambda _i' = \prod _{j=1}^n 1 / (1+\alpha \lambda _i / j) =\mathcal {O}(n^{-\alpha \lambda _i})\), and thus are guaranteed to be less than one for any choice of the learning rate parameter \(\alpha \). The critical difference with explicit SGD is that it is no longer required to have a small \(\alpha \) for stability because the eigenvalues of \({\varvec{Q}}_{1}^{n}\) will always be less than one.

Based on this analysis the magnitude of \({\varvec{P}}_{1}^{n}\) can become arbitrarily large, and thus explicit SGD is likely to numerically diverge. In contrast, \({\varvec{Q}}_{1}^{n}\) is guaranteed to be bounded, and so under any misspecification of the learning rate parameter the implicit SGD procedure is guaranteed to remain stable. The instability of explicit SGD is well known, and requires careful work to be avoided in practice. For example, a typical learning rate for explicit SGD is of the form \(a_n = \alpha (\alpha \beta + n)^{-1}\), where \(\beta \) is chosen so that the explicit updates will not diverge; a reasonable choice is to set \(\beta = \mathrm {trace}(\varvec{\mathcal {I}}(\varvec{\theta _\star }))\) and \(\alpha \) to be set close to \(1/\lambda _{min}\). Such *explicit* normalization of the learning rates is not necessary in implicit SGD because, as shown in Equation (4), the implicit update performs such normalization indirectly.

### 3.3 Choice of parameterization and efficiency

*mean-value parameterization*and \(\varvec{\omega }\) as the mean-value parameters. Starting with an estimate \(\varvec{\omega }_0\) of \(\varvec{\omega }_\star = \varvec{h}(\varvec{\theta _\star })\), we can define the SGD procedures on this new parameter space as

*Example*. Consider the problem of estimating \((\mu , \sigma ^2)\) from normal observations \(y_n \sim \mathcal {N}(\mu , \sigma ^2)\), and let \(\varvec{\theta _\star }= (\mu , \sigma ^2)\) which is not the natural parameterization. Consider sufficient statistics \(\varvec{s}(\varvec{y}_n) = (y_n, y_n^2)\) such that \(\mathbb {E}\left( \varvec{s}(\varvec{y}_n) \right) = (\mu , \mu ^2 + \sigma ^2) \triangleq (\omega _1, \omega _2)\). The parameter \(\varvec{\omega } = (\omega _1, \omega _2)\) corresponds to the mean-value parameterization. The inverse transformation is \(\mu = \omega _1\) and \(\sigma ^2 = \omega _2 - \omega _1^2\), and thus its Jacobian is

### 3.4 Choice of learning rate sequence

^{5}However, this requires knowledge of the Fisher information matrix on the true parameters \(\varvec{\theta _\star }\), which is usually unknown. The Venter process (Venter 1967) was the first method to follow an adaptive approach to estimate this matrix, and was later analyzed and extended by several other authors (Fabian 1973; Lai and Robbins 1979; Amari et al. 2000; Bottou and Le Cun 2005). Adaptive methods that perform an approximation of the matrix \({\varvec{C}}_\star \) (e.g., through a Quasi-Newton scheme) have recently been applied with considerable success (Schraudolph et al. 2007; Bordes et al. 2009); see Sect. 3.5.2 for more details.

*reparametrize*the problem, apply SGD on the new parameter space, and then perform the inverse transformation, as in Sect. 3.3.

#### 3.4.1 Practical considerations

There is voluminous amount of research literature on learning rate sequences for stochastic approximation and SGD. However, we decided to discuss this issue at the end of this section because the choice of the learning rate sequence conflates multiple design goals that are usually conflicting in practice, e.g., convergence (or bias), asymptotic variance, stability and so on.

In general, the theory presented so far indicates that the learning rate for first-order explicit SGD should be of the form \(a_n = \alpha (\alpha \beta + n)^{-1}\). Note that \(\lim _{n \rightarrow \infty } n a_n = \alpha \), so \(\alpha \) is indeed the learning rate parameter introduced in Sect. 1. Parameter \(\alpha \) will control the asymptotic variance and a reasonable choice would be the solution of (28), which requires estimates of the eigenvalues of the Fisher information matrix \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\). An easier method is to simply use \(\alpha = 1/\lambda _{min}\), where \(\lambda _{min}\) is the minimum eigenvalue of \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\); the value \(1/\lambda _{min}\) is an approximate solution for (28), and also has good empirical performance (Xu 2011; Toulis et al. 2014). Parameter \(\beta \) can be used to stabilize explicit SGD. In particular, one would want to control the variance of the stochastic gradient \(\mathrm {Var}\left( \nabla \ell (\varvec{\theta }_n; \varvec{y}_n)\right) = \varvec{\mathcal {I}}(\varvec{\theta _\star }) + \mathcal {O}(a_n)\), for points near \(\varvec{\theta _\star }\); see also the stability analysis in Sect. 3.2. One reasonable value would thus be \(\beta = \mathrm {trace}(\varvec{\mathcal {I}}(\varvec{\theta _\star }))\), which can be estimated easily by summing norms of the score function, i.e., \(\hat{\beta } = \sum _{i=1}^n ||\nabla \ell (\varvec{\theta }_i; \varvec{y}_i)||^2\), similar to (Amari et al. 2000; Duchi et al. 2011)—see also Sect. (3.5.2).

For implicit SGD, the situation is a bit easier because a learning rate sequence \(a_n = \alpha (\alpha + n)^{-1}\) works well in practice (Toulis et al. 2014). As before, \(\alpha \) controls the efficiency of the method and so we can set \(\alpha = 1/\lambda _{min}\) as in explicit SGD. The additional stability term (\(\beta \)) in explicit SGD is unnecessary because the implicit method performs such normalization (shrinkage) indirectly—see Eq. (4).

However, tuning the learning rate sequence eventually depends on problem-specific considerations, and there is a considerable variety of sequences that have been employed in practice (George and Powell 2006). Principled design of learning rates in SGD remains an important research topic (Schaul et al. 2012).

### 3.5 Some interesting extensions

#### 3.5.1 Averaged stochastic gradient descent

Estimation with SGD can be optimized for statistical efficiency only with knowledge of the underlying model. For example, the optimal learning rate parameter \(\alpha \) in first-order SGD requires knowledge of the eigenvalues of the Fisher information matrix \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\). In second-order SGD, optimality is achieved when one uses a sequence of matrices \({\varvec{C}}_n\) such that \(n{\varvec{C}}_n \rightarrow \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}\). Methods that approximate \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\) make up a significant class of methods in stochastic approximation. Another important class of stochastic approximation methods relies on *averaging* of the iterates. The corresponding SGD procedure is usually referred to as *averaged SGD*, or *ASGD* for short.^{6}

Polyak and Juditsky (1992) derive further significant results for averaged SGD, showing in particular that ASGD can be asymptotically efficient as second-order SGD under certain mild assumptions. In fact, due to the authors’ prior work in averaged stochastic approximation, ASGD is usually referred to as *Polyak-Ruppert* averaging scheme. Adoption of averaging schemes for statistical learning has been slow but steady over the years (Zhang 2004; Nemirovski et al. 2009; Bottou 2010; Cappé 2011). One practical reason is that averaging only helps when the underlying stochastic process is slow to converge, which is hard to know in practice; in fact, averaging can have an adverse effect when the underlying SGD process is converging well. Furthermore, the selection of the learning rate sequence is also important in ASGD, and a bad sequence can cause the algorithm to converge very slowly (Xu 2011), or even diverge. Research on ASGD is still ongoing as several directions, such as the combination of stable methods with averaging schemes, remain unexplored (e.g., stochastic proximal methods, implicit SGD). Furthermore, in a similar line of work, several methods have been developed that use averaging in order to reduce the variance of stochastic gradients (Johnson and Zhang 2013; Wang et al. 2013).

#### 3.5.2 Second-order stochastic gradient descent

Sakrison’s recursive estimation method (9) is the archetype of second-order SGD, but it requires an expensive matrix inversion at every iteration. Several methods have been developed that approximate such a matrix across iterations in stochastic approximation, and are generally termed *adaptive*. Early adaptive methods in stochastic approximation were given by Nevelson and Khasminskiĭ (1973) and Wei (1987); translated into a SGD procedure, such methods would recursively estimate \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\) by computing finite-differences \(\varvec{y}_{n, +}^j - \varvec{y}_{n, -}^j\) sampled at \(\varvec{\theta }_n+c_n \varvec{e}_j\) and \(\varvec{\theta }_n - c_n \varvec{e}_j\), respectively, where \(\varvec{e}_j\) is the \(j\)th unit basis vector and \(c_n\) is an appropriate sequence of positive numbers. While such methods are very useful in sequential experiment design where one has control over the data generation process, they are impractical for modern online learning problems.

*SGD-QN*, approximates the Fisher information matrix through a secant condition as in the original BFGS algorithm (Broyden 1965). The secant condition in SGD-QN is

*AdaGrad*(Duchi et al. 2011), which adapts multiple learning rates using gradient information. In one popular variant of the method, AdaGrad keeps a diagonal \((p \times p)\) matrix \({\varvec{A}}_n\) of learning rates that are updated at every iteration. Upon observing data \(\varvec{y}_n\), AdaGrad updates \({\varvec{A}}_n\) as follows:

*statistical information*that has been gathered so far for the parameter of interest \(\varvec{\theta _\star }\). The intuition behind a rate of the form \(a_n = \alpha / n\) is that the information after \(n\) iterations is proportional to \(n\), under the i.i.d. data assumption. In many dimensions where some parameter component affects outcomes less frequently than others, AdaGrad replaces the term \(n\) with an

*estimate*of the information that has

*actually*been received for that component. A (biased) estimate of this information is provided by the elements of \({\varvec{A}}_n\) in (36), and is justified since \(\mathbb {E}\left( \nabla \ell (\varvec{\theta }; \varvec{y}_n) \nabla \ell (\varvec{\theta }; \varvec{y}_n)^\intercal \right) = \varvec{\mathcal {I}}(\varvec{\theta })\). Interestingly, implicit SGD and AdaGrad share the common property of shrinking explicit SGD estimates according to the Fisher information matrix. Second-order implicit SGD methods are yet to be explored, but further connections are possible.

#### 3.5.3 Monte-Carlo stochastic gradient descent

A key requirement for the application of SGD procedures is that the likelihood is easy to evaluate. However, in many situations that are important in practice, this is not possible, for example when the likelihood is only known up to a normalizing constant. In such cases, definitions (10) and (11) cannot be applied directly since \(\nabla \ell (\varvec{\theta }; \varvec{y}_n)\) cannot be computed. However, if unbiased samples of the log-likelihood gradients are available, then explicit SGD can be readily applied. This is possible if sampling from the model is relatively easy.

*Monte-Carlo*SGD (Toulis and Airoldi 2014), can be constructed as follows. Starting from some estimate \(\varvec{\theta }^{\mathrm {mc}}_0\), iterate the following steps for each \(n\)th data point \(\varvec{y}_n\), where \(n=1, 2, \ldots ,N\):

- 1.
Get \(m\) samples from the model \(\widetilde{\varvec{y}_i} \sim f(\cdot ; \varvec{\theta }^{\mathrm {mc}}_{n-1})\), \(i = 1, 2, \ldots , m\).

- 2.
Compute average sufficient statistic \(\widetilde{\varvec{s}_n}=(1/m) \sum _{i=1}^m \varvec{s}(\widetilde{\varvec{y}_i})\).

- 3.Update the estimate through$$\begin{aligned} \varvec{\theta }^{\mathrm {mc}}_n = \varvec{\theta }^{\mathrm {mc}}_{n-1} + {\varvec{C}}_n (\varvec{s}(\varvec{y}_{n})- \widetilde{\varvec{s}_n}). \end{aligned}$$(37)

*impute*the expected value of the sufficient statistic that would otherwise be available if the likelihood was easy to evaluate. Furthermore, assuming \(n {\varvec{C}}_n \rightarrow {\varvec{C}}\), the asymptotic variance of the estimate satisfies

Theoretically Monte-Carlo SGD is based on *sampling-controlled* stochastic approximation methods in which the usual regression function of the Robbins–Monro procedure (6) is only accessible through sampling (Dupuis and Simha 1991), e.g., through MCMC. Convergence in such settings is subtle because it also depends on the ergodicity of the underlying Markov chain (Younes 1999). In practice, approximate variants of the aforementioned Monte-Carlo SGD procedure have been applied with considerable success to fit large models of neural networks, notably through the contrastive divergence algorithm, as we briefly discuss in Sect. 4.4.

## 4 Selected applications

SGD has found several important applications over the years. In this section we will review some of them, giving a preference to breadth over depth.

### 4.1 Online EM algorithm

### 4.2 MCMC sampling

- 1.
Sample \(\varvec{p}^* \sim \mathcal {N}(\varvec{0}, {\varvec{M}}^{-1})\).

- 2.
Using

*Hamiltonian dynamics*, compute \((\varvec{\theta }_n, \varvec{p}_n) = \mathrm {ODE}(\varvec{\theta }_{n-1}, p^*)\). - 3.
Perform a typical Metropolis-Hastings step for the proposed transition \((\varvec{\theta }_{n-1}, \varvec{p}^*) \rightarrow (\varvec{\theta }_n, \varvec{p}_n)\) with acceptance probability that is equal to \(\min [1, \mathrm {exp}(-H(\varvec{\theta }_n, \varvec{p}_n) + H(\varvec{\theta }_{n-1}, \varvec{p}^*)]\).

*position*of the system, and \(\varvec{p}\) is the

*momentum*. The Hamiltonian dynamics refer to a set of ordinary differential equations (ODE) that govern the movement of the system, and thus calculate the future values of \((\varvec{\theta }, \varvec{p})\) given a pair of current values. Being a closed physical system, the

*Hamiltonian*of the system is constant. Thus, in Step 3. of HMC it holds \(-H(\varvec{\theta }_n, \varvec{p}_n) + H(\varvec{\theta }_{n-1}, \varvec{p}^*) =0\), and thus the acceptance probability is one.

*Langevin dynamics*, defines the sampling iterations as follows (Girolami and Calderhead 2011):

*leapfrog method*. Parameter \(\epsilon > 0\) determines the size of the leapfrog in the numerical solution of Hamiltonian differential equations.

*mini-batch*of \(b\) samples that are usually employed in SGD to reduce noise in the stochastic gradients. Notably, Sato and Nakagawa (2014) proved that procedure (46) converges to the true posterior \(f(\varvec{\theta }| {\varvec{Y}})\) with an elegant use of stochastic calculus. Sampling through stochastic gradient Langevin dynamics has since generated a lot of significant work in MCMC sampling for very large datasets, and it is still a rapidly expanding research area with contributions from various disciplines (Hoffman et al. 2013; Pillai and Smith 2014; Korattikara et al. 2014).

### 4.3 Reinforcement learning

Reinforcement learning is the multidisciplinary study of how autonomous agents perceive, learn, and interact with their environment (Bertsekas and Tsitsiklis 1995). Typically, it is assumed that time \(t\) proceeds in discrete steps and at every step an *agent* is at state \(\varvec{x}_t \in \mathcal {X}\), where \(\mathcal {X}\) is some state space. Upon entering a state \(\varvec{x}_t\) two things happen. First, an agent receives a probabilistic *reward*\(R(\varvec{x}_t) \in \mathbb {R}^{}\), and then takes an *action*\(a \in \mathcal {A}\), where \(\mathcal {A}\) denotes the action-space. This action is determined by the agent’s *policy*, which is a function \(\pi : \mathcal {X} \rightarrow \mathcal {A}\), thus mapping a state to an action. Nature then decides a *transition* to state \(\varvec{x}_{t+1}\) through a density \(p(\varvec{x}_{t+1} | \varvec{x}_t)\) that is unknown to the agent.

*value function*\(V^\pi (\varvec{x})\) which quantifies the expected value of a specific state \(\varvec{x} \in \mathcal {X}\) for an agent. This is defined as

*linear value approximation*\(V(\varvec{x}) = \varvec{\theta _\star }^\intercal \phi (\varvec{x})\), where \(\phi (\varvec{x})\) maps a state to

*features*in a space with fewer dimensions, and \(\varvec{\theta _\star }\) is a vector of fixed parameters. If an agent is at state \(\varvec{x}_t\), then the recursive equation (48) can be rewritten as

*temporal differences*(TD) learning algorithm (Sutton 1988). Implicit versions of this algorithm have recently emerged in order to solve some of the known stability issues of the classical TD algorithm (Schapire and Warmuth 1996; Li 2008; Wang and Bertsekas 2013; Tamar et al. 2014). For example, Tamar et al. (2014) consider computing the term \(\varvec{\theta }_t^\intercal \varvec{\phi }_t\) at the future iterate, and thus the resulting

*implicit*TD algorithm is

### 4.4 Deep learning

Deep learning is the task of estimating parameters of statistical models that can be represented by multiple layers of nonlinear operations, such as neural networks (Bengio 2009). Such models, also referred to as *deep architectures*, consist of *units* that can perform a basic prediction task, and are grouped in layers such that the output of one layer forms the input of another layer that sits directly on top. Furthermore, in most situations the models are augmented with *latent units* that are defined to represent structured quantities of interest, such as edges or shapes in an image.

*contrastive divergence*(Hinton 2002; Carreira-Perpinan and Hinton 2005) has been applied for training such models with considerable success. The algorithm proceeds as follows for steps \(i=1, 2,\ldots \):

- 1.
Sample one state \(\varvec{y}^{\mathrm {( i)}}\) from the empirical distribution of observed states.

- 2.
Sample \(\varvec{x}^{\mathrm {( i)}} | \varvec{y}^{\mathrm {( i)}}\) i.e., the hidden state.

- 3.
Sample \(\varvec{y}^{\mathrm {({ i}, new)}} | \varvec{x}^{\mathrm {( i)}}\).

- 4.
Sample \(\varvec{x}^{\mathrm {({ i}, new)}} | \varvec{y}^{\mathrm {({ i}, new)}}\).

- 5.
Evaluate the gradient (53) using \((\varvec{x}^{\mathrm {( i)}}, \varvec{y}^{\mathrm {( i)}})\) for the second term, and the sample \((\varvec{x}^{\mathrm {({ i}, new)}}, \varvec{y}^{\mathrm {({ i}, new)}})\) for the first term.

- 6.
Update the parameters in \(\varvec{\theta }\) using constant-step size SGD and the estimated gradient from Step 4.

## Footnotes

- 1.
Second-order methods typically use the Hessian matrix of second-order derivatives of the log-likelihood and are discussed in detail in Sect. 3.

- 2.
Procedure (2) is actually an ascent algorithm because it aims to maximize the log-likelihood, and thus a more appropriate name would be stochastic gradient ascent. However, we will use the term “descent” in order to keep in line with the relevant optimization literature, which traditionally considers minimization problems through descent algorithms.

- 3.
The solution of the fixed-point equation (3) requires additional computations per iterations. However, Toulis et al. (2014) derive a computationally efficient implicit algorithm in the context of generalized linear models. Furthermore, approximate solutions of implicit updates are possible for any statistical model (see Eq. (4)).

- 4.
This is an important distinction because, traditionally, the focus in optimization has been to obtain fast convergence to some point \(\widehat{\varvec{\theta }}\) that minimizes the empirical loss, e.g., the maximum-likelihood estimator. From a statistical viewpoint, under variability of the data, there is a trade-off between convergence to an estimator and its asymptotic variance (Le et al. 2004).

- 5.
Similarly, a sequence of matrices \({\varvec{C}}_n\) can be designed such that \({\varvec{C}}_n \rightarrow \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}\) (Sakrison 1965).

- 6.
The acronym ASGD is also used in machine learning to denote

*asynchronous*SGD i.e., a variant of SGD that can be parallelized on multiple machines. We will not consider this variant here.

## Notes

### Acknowledgments

The authors wish to thank Leon Bottou, Bob Carpenter, David Dunson, Andrew Gelman, Brian Kulis, Xiao-Li Meng, Natesh Pillai, Neil Shephard, Daniel Sussman and Alexander Volfovsky for useful comments and discussion. This research was sponsored, in part, by NSF CAREER award IIS-1149662, ARO MURI award W911NF-11-1-0036, and ONR YIP award N00014-14-1-0485. PT is a Google Fellow in Statistics. EMA is an Alfred P. Sloan Research Fellow.

### References

- Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput.
**10**(2), 251–276 (1998)MathSciNetCrossRefGoogle Scholar - Amari, S.-I., Park, H., Kenji, F.: Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Comput.
**12**(6), 1399–1409 (2000)CrossRefGoogle Scholar - Bather, J.A.: Stochastic Approximation: A Generalisation of the Robbins–Monro Procedure, vol. 89. Cornell University, Mathematical Sciences Institute, New York (1989)Google Scholar
- Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett.
**31**(3), 167–175 (2003)MATHMathSciNetCrossRefGoogle Scholar - Bengio, Y.: Learning deep architectures for ai. Foundations and trends \(\textregistered \). Mach. Learn.
**2**, 1–127 (2009)MATHCrossRefGoogle Scholar - Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Comput.
**21**(6), 1601–1621 (2009)MATHMathSciNetCrossRefGoogle Scholar - Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer Publishing Company, Incorporated, New York (2012)Google Scholar
- Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic programming: an overview. In: Proceedings of the 34th IEEE Conference on Decision and Control, vol. 1, pp. 560–564 (1995)Google Scholar
- Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: careful quasi-Newton stochastic gradient descent. J. Mach. Learn. Res.
**10**, 1737–1754 (2009)MATHMathSciNetGoogle Scholar - Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer, New York (2010)Google Scholar
- Bottou, L., Le Cun, Y.: On-line learning for very large data sets. Appl. Stoch. Models Bus. Ind.
**21**(2), 137–151 (2005)MATHMathSciNetCrossRefGoogle Scholar - Bousquet, O., Bottou, L.: The tradeoffs of large scale learning. Adv. Neural Inf. Process. Syst.
**20**, 161–168 (2008)Google Scholar - Broyden, C.G.: A class of methods for solving nonlinear simultaneous equations. Math. Comput.
**19**, 577–593 (1965)MATHMathSciNetCrossRefGoogle Scholar - Cappé, O.: Online em algorithm for hidden Markov models. J. Comput. Graph. Stat.
**20**(3), 728–749 (2011)CrossRefGoogle Scholar - Cappé, O., Moulines, M.: On-line expectation-maximization algorithm for latent data models. J. R. Stat. Soc.
**71**(3), 593–613 (2009)MATHMathSciNetCrossRefGoogle Scholar - Carreira-Perpinan, M.A., Hinton, G.E.: On contrastive divergence learning. In: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pp. 33–40. Citeseer (2005)Google Scholar
- Cheng, L., Vishwanathan, S.V.N., Schuurmans, D., Wang, S., Caelli, T.: Implicit online learning with kernels. In: Proceedings of the 2006 Conference Advances in Neural Information Processing Systems 19, vol. 19, p. 249. MIT Press, Cambridge, 2007Google Scholar
- Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat.
**25**, 463–483 (1954)MATHCrossRefGoogle Scholar - Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B
**39**, 1–38 (1977)MATHMathSciNetGoogle Scholar - Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
**999999**, 2121–2159 (2011)MathSciNetGoogle Scholar - Dupuis, P., Simha, R.: On sampling controlled stochastic approximation. IEEE Trans. Autom. Control
**36**(8), 915–924 (1991)MATHMathSciNetCrossRefGoogle Scholar - El Karoui, N.: Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Stat.
**36**, 2757–2790 (2008)MATHMathSciNetCrossRefGoogle Scholar - Fabian, V.: On asymptotic normality in stochastic approximation. Ann. Math. Stat.
**39**, 1327–1332 (1968)MATHMathSciNetCrossRefGoogle Scholar - Fabian, V.: Asymptotically efficient stochastic approximation; the RM case. Ann. Stat.
**1**, 486–495 (1973)MATHMathSciNetCrossRefGoogle Scholar - Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. A
**222**, 309–368 (1922)MATHCrossRefGoogle Scholar - Fisher, R.A.: Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh (1925a)Google Scholar
- Fisher, R.A.: Theory of statistical estimation. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol. 22, pp. 700–725. Cambridge University Press, Cambridge (1925b)Google Scholar
- Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell.
**6**, 721–741 (1984)MATHCrossRefGoogle Scholar - George, A.P., Powell, W.B.: Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learn.
**65**(1), 167–198 (2006)CrossRefGoogle Scholar - Girolami, M.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B
**73**(2), 123–214 (2011)MathSciNetCrossRefGoogle Scholar - Gosavi, A.: Reinforcement learning: a tutorial survey and recent advances. INFORMS J. Comput.
**21**(2), 178–192 (2009)MATHMathSciNetCrossRefGoogle Scholar - Green, P.J.: Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J. R. Stat. Soc. Ser. B
**46**, 149–192 (1984)MATHGoogle Scholar - Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2011)Google Scholar
- Hennig, P., Kiefel, M.: Quasi-Newton methods: a new direction. J. Mach. Learn. Res.
**14**(1), 843–865 (2013)MATHMathSciNetGoogle Scholar - Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput.
**14**(8), 1771–1800 (2002)MATHMathSciNetCrossRefGoogle Scholar - Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res.
**14**(1), 1303–1347 (2013)MATHMathSciNetGoogle Scholar - Huber, P.J., et al.: Robust estimation of a location parameter. Ann. Math. Stat.
**35**(1), 73–101 (1964)MATHCrossRefGoogle Scholar - Huber, P.J.: Robust Statistics. Springer, New York (2011)CrossRefGoogle Scholar
- Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst.
**26**, 315–323 (2013)Google Scholar - Kivinen, J., Warmuth, M.K.: Additive versus exponentiated gradient updates for linear prediction. In: Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing, pp. 209–218Google Scholar
- Kivinen, J., Warmuth, M.K., Hassibi, B.: The p-norm generalization of the lms algorithm for adaptive filtering. IEEE Trans. Signal Process.
**54**(5), 1782–1793 (2006)CrossRefGoogle Scholar - Korattikara, A., Chen, Y., Welling, M.: Austerity in mcmc land: cutting the metropolis-hastings budget. In: Proceedings of the 31st International Conference on Machine Learning, pp. 181–189 (2014)Google Scholar
- Kulis, B., Bartlett, P.L.: Implicit online learning. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 575–582 (2010)Google Scholar
- Lai, T.L., Robbins, H.: Adaptive design and stochastic approximation. Ann. Stat.
**7**, 1196–1221 (1979)MATHMathSciNetCrossRefGoogle Scholar - Lange, K.: A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. B
**57**, 425–437 (1995)MATHGoogle Scholar - Lange, K.: Numerical Analysis for Statisticians. Springer, New York (2010)MATHCrossRefGoogle Scholar
- Le, C., Bottou Yann, L., Bottou, L.: Large scale online learning. Adv. Neural Inf. Process. Syst.
**16**, 217 (2004)Google Scholar - Lehmann, E.H., Casella, G.: Theory of Point Estimation, 2nd edn. Springer, New York (2003)Google Scholar
- Li, L.: A worst-case comparison between temporal difference and residual gradient with linear function approximation. In: Proceedings of the 25th International Conference on Machine Learning, ACM, pp. 560–567Google Scholar
- Liu, Z., Almhana, J., Choulakian, V., McGorman, R.: Online em algorithm for mixture with application to internet traffic modeling. Comput. Stat. Data Anal.
**50**(4), 1052–1071 (2006)MATHMathSciNetCrossRefGoogle Scholar - Ljung, L., Pflug, G., Walk, H.: Stochastic Approximation and Optimization of Random Systems, vol. 17. Springer, New York (1992)Google Scholar
- Martin, R.D., Masreliez, C.: Robust estimation via stochastic approximation. IEEE Trans. Inf. Theory
**21**(3), 263–271 (1975)MATHMathSciNetCrossRefGoogle Scholar - Murata, N.: A Statistical Study of On-line Learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge (1998)Google Scholar
- Nagumo, J.-I., Noda, A.: A learning method for system identification. IEEE Trans. Autom. Control
**12**(3), 282–287 (1967)CrossRefGoogle Scholar - National Research Council: Frontiers in Massive Data Analysis. The National Academies Press, Washington, DC (2013)Google Scholar
- Neal, R.M., Hinton, G.E.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, pp. 355–368. Springer, New York (1998)Google Scholar
- Neal, R.: Mcmc Using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo 2 (2011)Google Scholar
- Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, Chichester (1983)Google Scholar
- Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim.
**19**(4), 1574–1609 (2009)MATHMathSciNetCrossRefGoogle Scholar - Nevelson, M.B., Khasminskiĭ, R.Z.: Stochastic Approximation and Recursive Estimation, vol. 47. American Mathematical Society, Providence (1973)Google Scholar
- Nowlan, S.J.: Soft Competitive Adaptation: Neural Network Learning Algorithms Based on Fitting Statistical Mixtures. Carnegie Mellon University, Pittsburgh (1991)Google Scholar
- Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim.
**1**(3), 123–231 (2013)Google Scholar - Pillai, N.S., Smith, A.: Ergodicity of approximate mcmc chains with applications to large data sets. arXiv preprint http://arxiv.org/abs/1405.0182 (2014)
- Polyak, B.T., Tsypkin, Y.Z.: Adaptive algorithms of estimation (convergence, optimality, stability). Autom. Remote Control
**3**, 74–84 (1979)Google Scholar - Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim.
**30**(4), 838–855 (1992)Google Scholar - Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat.
**22**, 400–407 (1951)MATHMathSciNetCrossRefGoogle Scholar - Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim.
**14**(5), 877–898 (1976)MATHMathSciNetCrossRefGoogle Scholar - Rosasco, L., Villa, S., Công Vũ, B.: Convergence of stochastic proximal gradient algorithm. arXiv preprint http://arxiv.org/abs/1403.5074, 2014
- Ruppert, D.: Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering (1988)Google Scholar
- Ryu, E.K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Working paper. http://web.stanford.edu/~eryu/papers/spi.pdf (2014)
- Sacks, J.: Asymptotic distribution of stochastic approximation procedures. Ann. Math. Stat.
**29**(2), 373–405 (1958)MATHMathSciNetCrossRefGoogle Scholar - Sakrison, D.J.: Efficient recursive estimation; application to estimating the parameters of a covariance function. Int. J. Eng. Sci.
**3**(4), 461–483 (1965)MATHMathSciNetCrossRefGoogle Scholar - Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine Learning, ACM, pp. 791–798 (2007)Google Scholar
- Sato, M.-A., Ishii, S.: On-line em algorithm for the normalized Gaussian network. Neural Comput.
**12**(2), 407–432 (2000)CrossRefGoogle Scholar - Sato, I., Nakagawa, H.: Approximation analysis of stochastic gradient langevin dynamics by using Fokker-Planck equation and ito process. JMLR W&CP
**32**(1), 982–990 (2014)Google Scholar - Schapire, R.E., Warmuth, M.K.: On the worst-case analysis of temporal-difference learning algorithms. Mach. Learn.
**22**(1–3), 95–121 (1996)MATHGoogle Scholar - Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. arXiv preprint. http://arxiv.org/abs/1206.1106, 2012
- Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-Newton method for online convex optimization. In: Meila M., Shen X. (eds.) Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 2, pp. 436–443. San Juan, Puerto Rico (2007)Google Scholar
- Slock, D.T.M.: On the convergence behavior of the LMS and the normalized LMS algorithms. IEEE Trans. Signal Process.
**41**(9), 2811–2825 (1993)Google Scholar - Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn.
**3**(1), 9–44 (1988)Google Scholar - Tamar, A., Toulis, P., Mannor, S., Airoldi, E.: Implicit temporal differences. In: Neural Information Processing Systems, Workshop on Large-Scale Reinforcement Learning (2014)Google Scholar
- Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary latent variables. Adv. Neural Inf. Process. Syst.
**19**, 1345–1352 (2006)Google Scholar - Titterington, M.D.: Recursive parameter estimation using incomplete data. J. R. Stat. Soc. Ser. B
**46**, 257–267 (1984)MATHMathSciNetGoogle Scholar - Toulis, P., Airoldi, E.M.: Implicit stochastic gradient descent for principled estimation with large datasets. arXiv preprint http://arxiv.org/abs/1408.2923, 2014
- Toulis, P., Airoldi, E., Rennie, J.: Statistical analysis of stochastic gradient methods for generalized linear models. JMLR W&CP
**32**(1), 667–675 (2014)Google Scholar - Venter, J.H.: An extension of the robbins-monro procedur. Ann. Math. Stat.
**38**, 181–190 (1967)MATHMathSciNetCrossRefGoogle Scholar - Wang, C., Chen, X., Smola, A., Xing, E.: Variance reduction for stochastic gradient optimization. Adv. Neural Inf. Process. Syst.
**26**, 181–189 (2013)Google Scholar - Wang, M., Bertsekas, D.P.: Stabilization of stochastic iterative methods for singular and nearly singular linear systems. Math. Oper. Res.
**39**(1), 1–30 (2013)MathSciNetCrossRefGoogle Scholar - Wei, C.Z.: Multivariate adaptive stochastic approximation. Ann. Stat.
**3**, 1115–1130 (1987)CrossRefGoogle Scholar - Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688 (2011)Google Scholar
- Xu, W.: Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint http://arxiv.org/abs/1107.2490, 2011
- Younes, L.: On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics
**65**(3–4), 177–228 (1999)Google Scholar - Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, p. 116 (2004)Google Scholar