Statistics and Computing

, Volume 25, Issue 4, pp 781–795 | Cite as

Scalable estimation strategies based on stochastic approximations: classical results and new insights

Article

Abstract

Estimation with large amounts of data can be facilitated by stochastic gradient methods, in which model parameters are updated sequentially using small batches of data at each step. Here, we review early work and modern results that illustrate the statistical properties of these methods, including convergence rates, stability, and asymptotic bias and variance. We then overview modern applications where these methods are useful, ranging from an online version of the EM algorithm to deep learning. In light of these results, we argue that stochastic gradient methods are poised to become benchmark principled estimation procedures for large datasets, especially those in the family of stable proximal methods, such as implicit stochastic gradient descent.

Keywords

Maximum likelihood Recursive estimation Implicit stochastic gradient descent methods Optimal learning rate Asymptotic analysis Big data 

1 Introduction

Parameter estimation by optimization of an objective function, such as maximum likelihood and maximum a-posteriori, is a fundamental idea in statistics and machine learning (Fisher 1922; Lehmann and Casella 2003; Hastie et al. 2011). However, widely used optimization-based estimation algorithms, such as Fisher scoring, the EM algorithm, and iteratively reweighted least squares (Fisher 1925a; Dempster et al. 1977; Green 1984), are not scalable to modern datasets with hundreds of millions of data points and hundreds of thousands of covariates (National Research Council 2013).

To illustrate, let us consider the problem of estimating the true vector of parameters \(\varvec{\theta _\star }\in \mathbb {R}^{p}\) from an i.i.d. sample \({\varvec{Y}} = \{\varvec{y}_n\}\), for \(n=1,2,\ldots ,N\), where a data point \(\varvec{y}_n \in \mathbb {R}^{d}\) is distributed according to a density \(f(\varvec{y}_n; \varvec{\theta _\star })\) with log-likelihood function \(\ell (\varvec{\theta }; {\varvec{Y}}) = \sum _{n=1}^N \log f(\varvec{y}_n; \varvec{\theta })\). Traditional estimation methods are typically iterative and have a running-time complexity that ranges between \(\mathcal {O}(Np^3)\) and \(\mathcal {O}(Np)\), in worst cases and best cases, respectively. Newton–Raphson methods, for instance, update an estimate \(\varvec{\theta }_{n-1}^{\mathrm {nr}}\) of the parameters through the recursion
$$\begin{aligned} \varvec{\theta }_n^{\mathrm {nr}} = \varvec{\theta }_{n-1}^{\mathrm {nr}} - {\varvec{H}}^{-1}_{n-1}\nabla \ell (\varvec{\theta }_{n-1}^\mathrm {nr}; {\varvec{Y}}), \end{aligned}$$
(1)
where \({\varvec{H}}_n = \nabla \nabla \ell (\varvec{\theta }_n^{\mathrm {nr}}; {\varvec{Y}})\) is the \(p \times p\) Hessian matrix of the log-likelihood. The matrix inversion and the likelihood computation yield an algorithm with roughly \(\mathcal {O}(N p^{2+\epsilon })\) complexity which makes it unsuitable for large datasets. Fisher scoring replaces the Hessian matrix with its expected value i.e., it uses the Fisher information matrix \(\varvec{\mathcal {I}}(\varvec{\theta }) = -\mathbb {E}\left( \nabla \nabla \ell (\varvec{\theta }; \varvec{y}_n) \right) \), where the expectation is over the random sample \(\varvec{y}_n\). The advantage of this method is that a steady increase in the likelihood is possible, as in the EM algorithm, since \(\varvec{\mathcal {I}}(\varvec{\theta })\) is positive-definite, and thus the difference
$$\begin{aligned} \ell (\varvec{\theta }\!+\! \epsilon \Delta \varvec{\theta }; {\varvec{Y}}) \!-\!\ell (\varvec{\theta }; {\varvec{Y}}) \!\approx \! \epsilon \ell (\varvec{\theta }; {\varvec{Y}})^\intercal \varvec{\mathcal {I}}(\varvec{\theta })^{-1} \ell (\varvec{\theta }; {\varvec{Y}}) \!+\! \mathcal {O}(\epsilon ^2) \end{aligned}$$
can be made positive for an appropriately small value \(\epsilon >0\). However, Fisher scoring performs very similarly to Newton–Raphson in practice, and the two algorithms are actually identical in the exponential family (Lange 2010). Furthermore, Fisher scoring is computationally comparable to Newton–Raphson and thus unsuited for problems with large datasets.

Quasi-Newton (QN) methods are a powerful alternative and are widely used in practice. In QN methods, the Hessian is approximated by a low-rank matrix that is updated at each iteration as new values of the gradient become available, thus yielding algorithms with complexity \(\mathcal {O}(Np^2)\) or \(\mathcal {O}(Np)\) in certain favorable cases (Hennig and Kiefel 2013). Other general estimation algorithms such as EM or iteratively reweighted least squares (Green 1984) involve computations (e.g., inversions or maximizations between iterations) that are significantly more expensive than QN methods.

However, estimation with massive datasets requires a running-time complexity that is roughly \(\mathcal {O}(N p^{1-\epsilon })\) i.e., that is linear in \(N\) but sublinear in the parameter dimension \(p\). The first requirement on \(N\) seems hard to overcome since an iteration over all data points needs to be performed, at least when data are i.i.d.; thus, sublinearity in \(p\) is crucial (Bousquet and Bottou 2008). Such computational requirements have recently sparked interest in algorithms that utilize only first-order information i.e., methods that utilize only gradient computations.1 Such performance is achieved by the stochastic gradient descent (SGD) algorithm, which was initially proposed by Sakrison (1965) as a recursive estimation method, albeit not in first-order form. A typical first-order SGD is defined by the iteration
$$\begin{aligned}&\varvec{\theta }_{n}^{\mathrm {sgd}} = \varvec{\theta }_{n-1}^{\mathrm {sgd}} + a_n \nabla \ell (\varvec{\theta }_{n-1}^{\mathrm {sgd}}; \varvec{y}_{n}). \end{aligned}$$
(2)
We will refer to Eq. (2) as SGD with explicit updates, or explicit SGD for short, because the next iterate \(\varvec{\theta }_{n}^{\mathrm {sgd}}\) can be computed immediately after the new data point \(\varvec{y}_n\) is observed.2 The sequence \(a_n>0\) is a carefully chosen learning rate sequence which is typically defined such that \(n a_n\rightarrow \alpha > 0\) as \(n \rightarrow \infty \). The parameter \(\alpha >0\) is the learning rate parameter, and it is crucial for the convergence and stability of explicit SGD.

From a computational perspective, the SGD procedure (2) is appealing because the expensive inversion of \(p\times p\) matrices, as in Newton–Raphson, is replaced by a single sequence \(a_n >0\). Furthermore, the log-likelihood is evaluated at a single observation \(\varvec{y}_n\), and not on the entire dataset \({\varvec{Y}}\). Necessarily this incurs information loss which is important to quantify. From a theoretical perspective the explicit SGD updates are justified because, under typical regularity conditions, \(\mathbb {E}\left( \nabla \ell (\varvec{\theta _\star }; \varvec{y}_{n}) \right) =0 \) and thus \(\varvec{\theta }_n\rightarrow \varvec{\theta _\star }\) by the properties of the Robbins–Monro procedure (Robbins and Monro 1951). However, the explicit SGD procedure requires careful tuning of the learning rate parameter; small values of \(\alpha \) will make the iteration (2) very slow to converge, whereas for large values of \(\alpha \) explicit SGD will either have a large asymptotic variance, or even diverge numerically. As a recursive estimation method, explicit SGD was first proposed by Sakrison (1965) and has attracted attention in the machine learning community as a fast prediction method for large-scale problems (Le et al. 2004; Zhang 2004).

In order to stabilize explicit SGD without sacrificing computational efficiency, Toulis et al. (2014) defined the implicit SGD procedure through the iteration
$$\begin{aligned}&\varvec{\theta }_{n}^{\mathrm {im}} = \varvec{\theta }_{n-1}^{\mathrm {im}} + a_n \nabla \ell (\varvec{\theta }_{n}^{\mathrm {im}}; \varvec{y}_{n}). \end{aligned}$$
(3)
Note that Eq. (3) is implicit because the next iterate \(\varvec{\theta }_{n}^{\mathrm {im}}\) appears in both sides of the equation.3 This simple tweak of the explicit SGD procedure has quite remarkable statistical properties. In particular, assuming a common starting point \(\varvec{\theta }_{n-1}^{\mathrm {sgd}} = \varvec{\theta }_{n-1}^{\mathrm {im}} \triangleq \varvec{\theta }\), one can show through a simple Taylor approximation of (3) around \(\varvec{\theta }\), that the implicit update satisfies
$$\begin{aligned} \Delta \varvec{\theta }_{n}^{\mathrm {im}} = ({\varvec{I}} + a_n {\varvec{\hat{\mathcal {I}}}} (\varvec{\theta }; \varvec{y}_n))^{-1} \Delta \varvec{\theta }_{n}^{\mathrm {sgd}} + \mathcal {O}(a_n^2), \end{aligned}$$
(4)
where \(\Delta \varvec{\theta }_n^{} = \varvec{\theta }_n- \varvec{\theta }_{n-1}\) for both methods, and \({\varvec{\hat{\mathcal {I}}}} (\varvec{\theta }; \varvec{y}_n) = -\nabla \nabla \ell (\varvec{\theta }; \varvec{y}_n)\) is the observed Fisher information matrix. Thus, the implicit SGD procedure calculates updates that are a shrinked version of the explicit ones. In contrast to explicit SGD, implicit SGD is significantly more stable in small-samples, and it is also robust to misspecifications of the learning rate parameter \(\alpha \). Furthermore, implicit SGD computes iterates that belong in the support of the parameter space, whereas explicit SGD would normally require an additional projection step. Arguably, the normalized least mean squares (NLMS) filter (Nagumo and Noda 1967) was the first statistical model that used an implicit update as in Eq. (3) and was shown to be consistent and robust to input noise (Slock 1993). Theoretical justification for implicit SGD comes either from implicit variations of the Robbins–Monro procedure (Toulis et al. 2014), or through proximal methods in optimization (Parikh and Boyd 2013), such as mirror-descent (Nemirovski 1983; Beck and Teboulle 2003). Assuming differentiability of the log-likelihood, the implicit SGD update (3) can be expressed as a proximal method through the solution of
$$\begin{aligned} \varvec{\theta }_{n}^{\mathrm {im}} = \arg \max _{\varvec{\theta }} \left\{ - \frac{1}{2} ||\varvec{\theta }- \varvec{\theta }_{n-1}^{\mathrm {im}}||^2 + a_n \ell (\varvec{\theta }; \varvec{y}_n) \right\} , \end{aligned}$$
(5)
where the right-hand side is the proximal operator. The update in Eq. (5) is the stochastic version of the deterministic proximal point algorithm by Rockafellar (1976), and has been analyzed recently, in various forms, for convergence and stability (Ryu and Boyd 2014; Rosasco et al. 2014). Recent work has established the consistency of certain implicit methods similar to (3) (Kivinen and Warmuth 1995; Kivinen et al. 2006; Kulis and Bartlett 2010) and their robustness has been useful in a range of modern machine learning problems (Nemirovski et al. 2009; Kulis and Bartlett 2010; Schuurmans and Caelli 2007).

The structure of this chapter is as follows. In Sect. 2 we give an overview of the Robbins–Monro procedure and Sakrison’s recursive estimation method, which form the theoretical basis of SGD methods; we further provide a quick overview of early results on the statistical efficiency of the aforementioned methods. In Sect. 3, we formally introduce explicit and implicit SGD, and treat those procedures as statistical estimation methods that provide an estimator \(\varvec{\theta }_n\) of the model parameters \(\varvec{\theta _\star }\) after \(n\) iterations. In Sect. 3.1 we give results on the frequentist statistical properties of SGD estimators i.e., their asymptotic bias and asymptotic variance across multiple realizations of the dataset \({\varvec{Y}}\). We then leverage those results to study optimal learning rate sequences \(a_n\) (Sect. 3.4), the loss of statistical efficiency in SGD and ways to fix it through reparameterization (Sect. 3.3). We briefly discuss stability in Sect. 3.2. In Sect. 3.5, we present significant extensions to first-order SGD, namely averaged SGD, variants of second-order SGD, and Monte-Carlo SGD. Finally, in Sect. 4, we review significant applications of SGD in various areas of statistics and machine learning, namely in online EM, MCMC posterior sampling, reinforcement learning, and deep learning.

2 Stochastic approximations

2.1 Robbins and Monro’s procedure

Consider the one-dimensional setting where one data point is denoted by \(y_n \in \mathbb {R}^{}\) and it is controlled by a parameter \(\theta \) with regression function \(M(\theta ) = \mathbb {E}\left( y \right| \theta )\) that is nondecreasing, and whose analytic form might be unknown. Robbins and Monro (1951) considered the problem of finding the unique point \(\theta _\star \) for which \(M(\theta _\star )=0\). They devised a procedure, known as the Robbins–Monro procedure, in which an estimate \(\theta _{n-1}\) of \(\theta _\star \) is utilized to sample one new data point \(y_n\) such that \(\mathbb {E}\left( y_n \right| \theta _{n-1}) = M(\theta _{n-1})\); the estimate is then updated according to the following simple rule:
$$\begin{aligned} \theta _n = \theta _{n-1} - a_n y_n. \end{aligned}$$
(6)
The scalar \(a_n > 0\) is the learning rate and should decay to zero, but not too fast in order to guarantee convergence. Robbins and Monro (1951) proved that \(\mathbb {E}\left( (\theta _n - \theta _\star )^2 \right) \rightarrow 0\) when
  1. (a)

    \((x-\theta _\star ) M(x)> 0\) for \(x\) in a neighborhood of \(\theta _\star \),

     
  2. (b)

    \(\mathbb {E}\left( y_n^2 \right| \theta ) < \infty \) for any \(\theta \), and

     
  3. (c)

    \(\sum _{i=1}^\infty a_i= \infty \) and \(\sum _{i=1}^\infty a_i^2 < \infty \).

     
The original proof is technical but the main idea is straightforward. Let \(b_n \triangleq \mathbb {E}\left( (\theta _n-\theta _\star )^2 \right) \) denote the squared error, then through iteration (6) one can obtain
$$\begin{aligned} b_n = b_{n-1} -2a_n \mathbb {E}\left( (\theta _{n-1}-\theta _\star ) M(\theta _{n-1}) \right) + a_n^2 \mathbb {E}\left( y_n^2 \right) . \end{aligned}$$
(7)
In the neighborhood of \(\theta _\star \) we have \(M(\theta _{n-1}) \approx M'(\theta _\star ) (\theta _{n-1} - \theta _\star )\), and thus
$$\begin{aligned} b_n = (1-2 a_n M'(\theta _\star )) b_{n-1} + a_n^2 \mathbb {E}\left( y_n^2 \right) . \end{aligned}$$
(8)
For a learning rate sequence of the form \(a_n = \alpha /n\), typical proof techniques in stochastic approximation (Chung 1954) can establish that \(b_n \rightarrow 0\). Furthermore, it holds \(n b_n \rightarrow \alpha ^2 \sigma ^2 (2\alpha M'(\theta _\star ) - 1)^{-1}\) where \(\sigma ^2 \triangleq \mathbb {E}\left( y_n^2 \right| \theta _\star )\) when this limit exists; this result was not given in the original paper by Robbins and Monro (1951) but it was soon derived by several other authors (Chung 1954; Sacks 1958; Fabian 1968). Thus, the learning parameter \(\alpha \) is critical for the performance of the Robbins–Monro procedure. Its optimal value is \(\alpha _\star = 1/M'(\theta _\star )\), which requires knowledge of the slope of \(M(\cdot )\) at the true parameter values. In the multidimensional case, the efficiency of stochastic approximations—including stochastic gradient descent—depends on the Jacobian of the mean-value function of the statistic used in the iterations (see Sect. 3.1). This early result spawned an important line of research on adaptive stochastic approximation methods, such as the Venter process (Venter 1967), in which quantities that are important for the convergence of the stochastic process (e.g., the quantity \(M'(\theta _\star )\)) are also being estimated along the way.

2.2 Sakrison’s recursive estimation method

Although initially applied in sequential experiment design, the Robbins–Monro procedure was soon adapted for estimation. Sakrison (1965) was interested in estimating the parameters \(\varvec{\theta _\star }\) of a model that generated i.i.d. observations \(\varvec{y} _n\) in a way that is computationally and statistically efficient, similar to our setup in the introduction. He recognized that the statistical identity \(\mathbb {E}\left( \nabla \ell (\varvec{\theta _\star }; \varvec{y}_n) \right) =0\), where the expectation is over the observed data \(\varvec{y}_n\), provides the theoretical basis for a general estimation method using the Robbins–Monro procedure. Sakrison’s recursive estimation method was essentially one of the first explicit SGD method proposed in the literature:
$$\begin{aligned} \varvec{\theta }_n^{\mathrm {sak}} \approx \varvec{\theta }_{n-1}^{\mathrm {sak}} + (1/n) \varvec{\mathcal {I}}(\varvec{\theta }_{n-1}^{\mathrm {sak}})^{-1} \nabla \ell (\varvec{\theta }_{n-1}^{\mathrm {sak}}; \varvec{y}_n), \end{aligned}$$
(9)
The SGD procedure (9) is second-order since it is using a matrix to condition the gradient of the log-likelihood. Under typical regularity conditions \(\varvec{\theta }_n^{\mathrm {sak}} \rightarrow \varvec{\theta _\star }\), and thus \( \varvec{\mathcal {I}}(\varvec{\theta }_n^{\mathrm {sak}}) \rightarrow \varvec{\mathcal {I}}(\varvec{\theta _\star })\). Sakrison (1965) also proved that \(n \mathbb {E}\left( ||\varvec{\theta }_n^{\mathrm {sak}} - \varvec{\theta _\star }||^2 \right) \rightarrow \mathrm {trace}(\varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1})\), and so the estimation of \(\varvec{\theta _\star }\) is asymptotically efficient under this norm objective. It is interesting to note that updates of the form (9) appeared very early in the statistical literature. For example, Fisher (1925b) suggested that an inefficient estimator \(\varvec{\theta }_N\) using \(N\) data points can be made asymptotically efficient by considering a new estimator \(\varvec{\theta }_N^+ = \varvec{\theta }_N + (1/N) \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1} \sum _{i=1}^N \nabla \ell (\varvec{\theta }_N; \varvec{y}_i)\). The surprising result in Sakrison’s work was that asymptotically optimal estimation is also possible by using only gradients of the log-likelihood on single data points \(\varvec{y}_i\) in the iterated algorithm (9).

3 Estimation with stochastic gradient methods

For the rest of this chapter we will consider a simple generalization of explicit and implicit SGD that is similar to Sakrison’s method as follows:
$$\begin{aligned}&\varvec{\theta }_{n}^{\mathrm {sgd}} = \varvec{\theta }_{n-1}^{\mathrm {sgd}} + {\varvec{C}}_n \nabla \ell (\varvec{\theta }_{n-1}^{\mathrm {sgd}}; \varvec{y}_{n}), \end{aligned}$$
(10)
$$\begin{aligned}&\varvec{\theta }_{n}^{\mathrm {im}} = \varvec{\theta }_{n-1}^{\mathrm {im}} + {\varvec{C}}_n \nabla \ell (\varvec{\theta }_{n}^{\mathrm {im}}; \varvec{y}_{n}). \end{aligned}$$
(11)
In general all \({\varvec{C}}_n\) are symmetric and positive-definite matrices, and serve to stabilize and optimize stochastic iterations as in (10) and (11). In the limit \(n {\varvec{C}}_n \rightarrow {\varvec{C}}\) where \({\varvec{C}}\) is a symmetric and positive-definite matrix. If \({\varvec{C}}_n\) is not trivial (e.g., scaled identity), we will refer to (10) and (11) as second-order explicit SGD and second-order implicit SGD, respectively. When \({\varvec{C}}_n = a_n {\varvec{I}}\) i.e., it is the scaled identity matrix for some sequence \(a_n >0\) satisfying the Robbins–Monro conditions, we will refer to (10) and (11) as first-order explicit SGD and first-order implicit SGD, respectively; in this case, definitions (10) and (11) are identical to definitions (2) and (3) in the introduction. In some cases, we will consider models in the exponential family under the natural parameterization with density
$$\begin{aligned} f(\varvec{y}_n; \varvec{\theta _\star }) = \exp \{ \varvec{\theta _\star }^\intercal \varvec{s}(\varvec{y}_{n})- A(\varvec{\theta _\star }) +B(\varvec{y}_n) \}, \end{aligned}$$
(12)
where \(\varvec{s}(\varvec{y}_{n})\) is the vector of \(p\) sufficient statistics, and \(A(\cdot ), B(\cdot )\) are appropriate real-valued functions. The SGD procedures simplify to
$$\begin{aligned}&\varvec{\theta }_{n}^{\mathrm {sgd}} = \varvec{\theta }_{n-1}^{\mathrm {sgd}} + {\varvec{C}}_n (\varvec{s}(\varvec{y}_{n})- \nabla A(\varvec{\theta }_{n-1}^{\mathrm {sgd}})), \end{aligned}$$
(13)
$$\begin{aligned}&\varvec{\theta }_{n}^{\mathrm {im}} = \varvec{\theta }_{n-1}^{\mathrm {im}} + {\varvec{C}}_n (\varvec{s}(\varvec{y}_{n})- \nabla A(\varvec{\theta }_{n}^{\mathrm {im}})). \end{aligned}$$
(14)
In what follows, we will consider a frequentist evaluation of SGD as a statistical estimation method i.e., we will consider \(\varvec{\theta }_{n}^{\mathrm {sgd}}\) (or \(\varvec{\theta }_{n}^{\mathrm {im}}\)) to be an estimator of \(\varvec{\theta _\star }\), and we will focus on its bias and variance across multiple realizations of the dataset \({\varvec{Y}} = \{\varvec{y}_1, \varvec{y}_2, \ldots , \varvec{y}_n \}\), under the same model and parameter \(\varvec{\theta _\star }\).4

3.1 Asymptotic bias and variance

Typically, online procedures such as SGD have two phases, namely the exploration phase (or search phase) and the convergence phase (Amari 1998; Benveniste et al. 2012). In the exploration phase the iterates rapidly approach \(\varvec{\theta _\star }\), whereas in the convergence phase they jitter around \(\varvec{\theta _\star }\) within a ball of slowly decreasing radius. We will overview a typical analysis of SGD in the final convergence phase in which we assume that a Taylor approximation in the neighborhood of \(\varvec{\theta _\star }\) is accurate (Murata 1998; Toulis et al. 2014). In particular let \(\varvec{\mu }(\varvec{\theta }) = \mathbb {E}\left( \nabla \ell (\varvec{\theta }; \varvec{y}_{n}) \right) \), and assume that
$$\begin{aligned} \varvec{\mu }(\varvec{\theta }_n)= \varvec{\mu }(\varvec{\theta _\star }) + {\varvec{J}}_{\mu }(\varvec{\theta _\star }) (\varvec{\theta }_n-\varvec{\theta _\star }) + o(a_n), \end{aligned}$$
(15)
where \({\varvec{J}}_\mu \) is the Jacobian of the function \(\varvec{\mu }(\cdot )\), and \(o(a_n)\) denotes a vector sequence \(\varvec{v}_n\) for which \(||\varvec{v}_n|| / a_n \rightarrow 0\). Under typical regularity conditions \(\varvec{\mu }(\varvec{\theta _\star }) = \varvec{0}\) and \({\varvec{J}}_\mu (\varvec{\theta _\star }) = -\varvec{\mathcal {I}}(\varvec{\theta _\star })\). Thus, if we denote the biases of the two SGD methods as \(\mathbb {E}(\varvec{\theta }_{n}^{\mathrm {sgd}} - \varvec{\theta _\star }) \triangleq \varvec{b}_{n}^{\mathrm {sgd}}\) and \(\mathbb {E}(\varvec{\theta }_{n}^{\mathrm {im}} - \varvec{\theta _\star }) \triangleq \varvec{b}_{n}^{\mathrm {im}}\), by taking expectations in Eqs. (10) and (11) we obtain
$$\begin{aligned}&\varvec{b}_{n}^{\mathrm {sgd}} = ({\varvec{I}} - {\varvec{C}}_n \varvec{\mathcal {I}}(\varvec{\theta _\star })) \quad \varvec{b}_{n-1}^{\mathrm {sgd}} + o(a_n), \end{aligned}$$
(16)
$$\begin{aligned}&\varvec{b}_{n}^{\mathrm {im}} = ({\varvec{I}} + {\varvec{C}}_n \varvec{\mathcal {I}}(\varvec{\theta _\star }))^{-1} \quad \varvec{b}_{n-1}^{\mathrm {im}} + o(a_n). \end{aligned}$$
(17)
We observe that the convergence rate at which the two methods become unbiased in the limit differs in two significant ways. First, the explicit SGD method converges faster than the implicit one because \(||({\varvec{I}} - {\varvec{C}}_n \varvec{\mathcal {I}}(\varvec{\theta _\star }))|| < ||({\varvec{I}}+ {\varvec{C}}_n \varvec{\mathcal {I}}(\varvec{\theta _\star }))^{-1}||\), for sufficiently large \(n\); the rates become equal in the limit as \(a_n \rightarrow 0\). However, the implicit method compensates by being more stable in the specification of the condition matrices \({\varvec{C}}_n\). For example, the explicit SGD requires that the sequence \({\varvec{I}} - {\varvec{C}}_n \varvec{\mathcal {I}}(\varvec{\theta _\star })\) is comprised of matrices with eigenvalues less than one, in order to guarantee stability; this is a significant source of trouble when applying explicit SGD in practice. In contrast, for any specification of positive-definite \({\varvec{C}}_n\), the eigenvalues of \(({\varvec{I}} + {\varvec{C}}_n \varvec{\mathcal {I}}(\varvec{\theta _\star }))^{-1}\) are less than one, and thus implicit SGD is unconditionally stable; we will discuss more about stability in Sect. 3.4.
In regard to statistical efficiency, Taylor approximation can also be used to establish recursive equations for the asymptotic variance of \(\varvec{\theta }_{n}^{\mathrm {sgd}}\) and \(\varvec{\theta }_{n}^{\mathrm {im}}\). For example, Toulis et al. (2014) show that if \({\varvec{C}}\) is a symmetric matrix that commutes with \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\) such that \((2 {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) - {\varvec{I}})\) is positive-definite and \(n {\varvec{C}}_n \rightarrow {\varvec{C}}\), it holds
$$\begin{aligned} n \mathrm {Var}(\varvec{\theta }_{n}^{\mathrm {sgd}}) \rightarrow (2 {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) - {\varvec{I}})^{-1} {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) {\varvec{C}}^\intercal , \nonumber \\ n \mathrm {Var}(\varvec{\theta }_{n}^{\mathrm {im}}) \rightarrow (2 {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) - {\varvec{I}})^{-1} {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) {\varvec{C}}^\intercal ; \end{aligned}$$
(18)
i.e., both SGD methods have the same asymptotic variance. Thus, for first-order SGD procedures where \({\varvec{C}}_n = a_n {\varvec{I}}\) with \(n a_n \rightarrow \alpha > 0\) we obtain
$$\begin{aligned} n \mathrm {Var}(\varvec{\theta }_{n}^{\mathrm {sgd}}) \rightarrow \alpha ^2 (2 \alpha \varvec{\mathcal {I}}(\varvec{\theta _\star }) - {\varvec{I}})^{-1} \varvec{\mathcal {I}}(\varvec{\theta _\star }), \nonumber \\ n \mathrm {Var}(\varvec{\theta }_{n}^{\mathrm {im}}) \rightarrow \alpha ^2 (2 \alpha \varvec{\mathcal {I}}(\varvec{\theta _\star }) - {\varvec{I}})^{-1} \varvec{\mathcal {I}}(\varvec{\theta _\star }). \end{aligned}$$
(19)
The matrix term \((2 {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) - {\varvec{I}})^{-1}\) represents the information that is lost by SGD, and it needs to be identity for optimal statistical efficiency (see Sect. 3.4). In fact, in more generality, this term is equal to \((2 {\varvec{C}} {\varvec{J}}_\mu (\varvec{\theta _\star })-{\varvec{I}})^{-1}\) where \(\varvec{\mu }(\varvec{\theta })\) is mean-value function of the statistic used in SGD (see also Equation (15)), and \({\varvec{J}}_\mu (\varvec{\theta _\star })\) is its Jacobian at the true parameter values. Therefore, the asymptotic efficiency of SGD methods depends crucially on the Jacobian of the mean-value function of the statistic used in the SGD iterations.

Asymptotic variance results similar to (18) were first studied in the stochastic approximation literature by Chung (1954), Sacks (1958), and followed by Fabian (1968) and several other authors (see also Ljung et al. 1992, Parts I, II), but not in a closed-form (18), as most analyses were not done under the context of recursive statistical estimation. Furthermore, Sakrison’s asymptotic efficiency result (Sakrison 1965) can be recovered by setting \({\varvec{C}}_n = (1/n) \varvec{\mathcal {I}}(\varvec{\theta }_{n-1})^{-1}\); in this case the asymptotic variance for both estimators is \((1/n) \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}\) i.e., it is the optimal asymptotic efficiency of the maximum-likelihood estimator.

3.2 Stability issues

Stability has been a well-known issue for explicit SGD. The main problem in practice is that the learning rate sequence needs to agree with the eigenvalues of the Fisher information matrix. To see this, let us simplify (16) and (17) by dropping the remainder terms \(o(a_n)\). Then we obtain
$$\begin{aligned}&\varvec{b}_{n}^{\mathrm {sgd}} = ({\varvec{I}}- {\varvec{C}}_{n} \varvec{\mathcal {I}}(\varvec{\theta _\star })) \varvec{b}_{n-1}^{\mathrm {sgd}} = {\varvec{P}}_{1}^{n} \varvec{b}_{0}^{\mathrm {}}, \end{aligned}$$
(20)
$$\begin{aligned}&\varvec{b}_{n}^{\mathrm {im}} = ({\varvec{I}}+ {\varvec{C}}_{n} \varvec{\mathcal {I}}(\varvec{\theta _\star }))^{-1} \varvec{b}_{n-1}^{\mathrm {im}} = {\varvec{Q}}_{1}^{n} \varvec{b}_{0}^{\mathrm {}}, \end{aligned}$$
(21)
where \({\varvec{P}}_{1}^{n} = \prod _{i=1}^n ({\varvec{I}}- {\varvec{C}}_{i} \varvec{\mathcal {I}}(\varvec{\theta _\star }))\), \({\varvec{Q}}_{1}^{n} = \prod _{i=1}^n ({\varvec{I}}+ {\varvec{C}}_{i} \varvec{\mathcal {I}}(\varvec{\theta _\star }))^{-1}\), and \(\varvec{b}_{0}^{\mathrm {}}\) denotes the initial bias of the two procedures from some common starting point \(\varvec{\theta }_0\). Thus, the matrices \({\varvec{P}}_{1}^{n}\) and \({\varvec{Q}}_{1}^{n}\) describe how fast the initial bias decays for the explicit and implicit SGD, respectively. Assuming convergence, \({\varvec{P}}_{1}^{n} \rightarrow {\varvec{0}}\) and \({\varvec{Q}}_{1}^{n} \rightarrow {\varvec{0}}\), and thus we say that both methods are asymptotically stable. However, they have significant differences in small-to-moderate samples. For simplicity, let us compare the two SGD procedures in their first-order formulation where \({\varvec{C}}_n = a_n {\varvec{I}}\) and \(a_n = \alpha /n\) for some \(\alpha > 0\).

In explicit SGD, the eigenvalues of \({\varvec{P}}_{1}^{n}\) can be calculated as \(\lambda _i' = \prod _j (1-\alpha \lambda _i/j) = \mathcal {O}(n^{-\alpha \lambda _i})\), for \(0 < \alpha \lambda _i <1 \), where \(\lambda _i\) are the eigenvalues of the Fisher information matrix \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\). Thus, the magnitude of \({\varvec{P}}_{1}^{n}\) will be dominated by \({\lambda }_{\max }\), the maximum eigenvalue of \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\), and the rate of convergence to zero will be dominated by \(\lambda _{\min }\), the minimum eigenvalue of \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\). The condition \(\alpha {\lambda }_{\max }\le 1 \Rightarrow \alpha \le 1/{\lambda }_{\max }\) is required for stability, but for fast convergence we require \(\alpha \lambda _{\min }\approx 1\). In high-dimensional settings, this could be the source of serious problems because \({\lambda }_{\max }\) could be at the order of \(p\) i.e., the number of model parameters. Thus, in explicit SGD the requirements for stability and speed of convergence are in conflict. A conservative learning rate sequence can guarantee stability but this comes at a price in convergence which will be at the order of \(\mathcal {O}(n^{-\alpha \lambda _{\min }})\), and vice versa. In stark contrast, the implicit procedure is unconditionally stable. The eigenvalues of \({\varvec{Q}}_{1}^{n}\) are \(\lambda _i' = \prod _{j=1}^n 1 / (1+\alpha \lambda _i / j) =\mathcal {O}(n^{-\alpha \lambda _i})\), and thus are guaranteed to be less than one for any choice of the learning rate parameter \(\alpha \). The critical difference with explicit SGD is that it is no longer required to have a small \(\alpha \) for stability because the eigenvalues of \({\varvec{Q}}_{1}^{n}\) will always be less than one.

Based on this analysis the magnitude of \({\varvec{P}}_{1}^{n}\) can become arbitrarily large, and thus explicit SGD is likely to numerically diverge. In contrast, \({\varvec{Q}}_{1}^{n}\) is guaranteed to be bounded, and so under any misspecification of the learning rate parameter the implicit SGD procedure is guaranteed to remain stable. The instability of explicit SGD is well known, and requires careful work to be avoided in practice. For example, a typical learning rate for explicit SGD is of the form \(a_n = \alpha (\alpha \beta + n)^{-1}\), where \(\beta \) is chosen so that the explicit updates will not diverge; a reasonable choice is to set \(\beta = \mathrm {trace}(\varvec{\mathcal {I}}(\varvec{\theta _\star }))\) and \(\alpha \) to be set close to \(1/\lambda _{min}\). Such explicit normalization of the learning rates is not necessary in implicit SGD because, as shown in Equation (4), the implicit update performs such normalization indirectly.

Finally, an important line of work in the stability of stochastic approximations has been inspired by Huber’s work in robust statistics (Huber et al. 1964; Huber 2011). In our notation, robust stochastic approximation considers iterations of the following form
$$\begin{aligned} \varvec{\theta }_n= \varvec{\theta }_{n-1}+ {\varvec{C}}_n \varvec{\psi } (\varvec{s}(\varvec{y}_{n})- \varvec{h}(\varvec{\theta }_{n-1})), \end{aligned}$$
(22)
where an appropriate function \(\varvec{\psi }\) is sought for robust estimation; in this problem we assume \(\mathbb {E}\left( \varvec{s}(\varvec{y}_{n}) \right) = h(\varvec{\theta _\star })\) but the distribution of \(\varvec{s}(\varvec{y}_{n})- \varvec{h}(\varvec{\theta _\star })\) – denoted by \(f(\cdot )\) – is unknown. In a typical setup, \(f(\cdot )\) is considered to belong to a family of distributions \(\mathcal {P}\), and \(\varvec{\psi }\) is selected as
$$\begin{aligned} \varvec{\psi }_\star = \arg _{\psi } \min \max _{f \in \mathcal {P}} \lim _{n \rightarrow \infty } n \mathrm {Var}(\varvec{\theta }_n) \end{aligned}$$
i.e., such that the maximum possible variance over the family \(\mathcal {P}\) is minimized. Several important results have been achieved by Martin and Masreliez (1975) and Polyak and Tsypkin (1979). For example, in linear models where \(\varvec{\mu }(\cdot )\) is linear in \(\varvec{\theta }\) and \(\varvec{s}(\varvec{y}_{n})\) is one-dimensional, consider the general family \(\mathcal {P} = \{f : f(0) \ge \epsilon \}\) as the set of all symmetric densities that are positive at 0. Then the optimal choice is \(\varvec{\psi }_\star = \mathrm {sign}(\cdot )\) i.e., the sign function, because it can be shown that the Laplace distribution is the member density of \(\mathcal {P}\) that gives the least information about the parameters \(\varvec{\theta _\star }\).

3.3 Choice of parameterization and efficiency

First-order SGD methods are attractive for their computational performance, but the variance result (19) shows that they may suffer a significant loss in statistical efficiency. However, a reparameterization of the problem could yield a first-order SGD method that is optimal. The method can be described as follows. First, assume the exponential family (12) such that \(\nabla \ell (\varvec{\theta }; \varvec{y}_{n}) = \varvec{s}(\varvec{y}_{n})- \varvec{h}(\varvec{\theta })\), where \(h(\varvec{\theta }) = \nabla A(\varvec{\theta }) = \mathbb {E}\left( \varvec{s}(\varvec{y}_{n}) \right| \varvec{\theta _\star }=\varvec{\theta })\), and consider the reparameterization
$$\begin{aligned} \varvec{\omega }\triangleq \varvec{h}(\varvec{\theta }), \end{aligned}$$
(23)
which we assume it exists, it is 1-1 and easy to compute; these are critical assumptions that are hard, but not impossible to hold in practice. We will refer to (23) as the mean-value parameterization and \(\varvec{\omega }\) as the mean-value parameters. Starting with an estimate \(\varvec{\omega }_0\) of \(\varvec{\omega }_\star = \varvec{h}(\varvec{\theta _\star })\), we can define the SGD procedures on this new parameter space as
$$\begin{aligned}&\varvec{\omega }_n^\mathrm {sgd}= \varvec{\omega }_{n-1}^\mathrm {sgd}+ (1/n) (\varvec{s}(\varvec{y}_{n})- \varvec{\omega }_{n-1}^\mathrm {sgd}), \end{aligned}$$
(24)
$$\begin{aligned}&\varvec{\omega }_n^\mathrm {im}= \varvec{\omega }_{n-1}^\mathrm {im}+ (1/n) (\varvec{s}(\varvec{y}_{n})- \varvec{\omega }_n^\mathrm {im}), \end{aligned}$$
(25)
where we also set \({\varvec{C}}_n=(1/n) {\varvec{I}}\) so that \({\varvec{C}}= {\varvec{I}}\). In this case, the explicit SGD simply calculates the running average of the complete sufficient statistic i.e., \(\varvec{\omega }_n^\mathrm {sgd}= n^{-1} \sum _{i=1}^n \varvec{s}(\varvec{y}_{i})\), and thus it is identical to the MLE estimator; similarly the implicit SGD satisfies \(\varvec{\omega }_n^\mathrm {im}= (n+1)^{-1} \sum _{i=1}^n \varvec{s}(\varvec{y}_{i})\) i.e., it is a slightly biased version of the MLE. It is thus straightforward to show (see for example Toulis and Airoldi 2014) that the mean-value parameterization is optimal i.e.,
$$\begin{aligned} \mathrm {Var}\left( \varvec{h}^{-1}(\varvec{\omega }_n^\mathrm {sgd})\right)&\rightarrow (1/n) \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}, \mathrm {Var}\left( \varvec{h}^{-1}(\varvec{\omega }_n^\mathrm {im})\right) \nonumber \\&\rightarrow (1/n) \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}. \end{aligned}$$
(26)
Intuitively, the mean-value parameterization transforms all parameters into location parameters. The Jacobian of the regression function of the statistic is \({\varvec{J}}_\mu (\varvec{\omega }_\star )= \nabla _\omega \mathbb {E}\left( \varvec{s}(\varvec{y}_{n}) \right| \varvec{\omega }=\varvec{\omega }_\star ) = {\varvec{I}}\), and thus the information loss described in Equation (18) is avoided since \((2 {\varvec{C}} {\varvec{J}}_\mu (\varvec{\omega }_\star ) - {\varvec{I}})^{-1} = {\varvec{I}}\). Transforming back to the original parameter space incurs no information loss as well, and so estimation of \(\varvec{\theta _\star }\) is efficient. This method is illustrated in the following example.
Example. Consider the problem of estimating \((\mu , \sigma ^2)\) from normal observations \(y_n \sim \mathcal {N}(\mu , \sigma ^2)\), and let \(\varvec{\theta _\star }= (\mu , \sigma ^2)\) which is not the natural parameterization. Consider sufficient statistics \(\varvec{s}(\varvec{y}_n) = (y_n, y_n^2)\) such that \(\mathbb {E}\left( \varvec{s}(\varvec{y}_n) \right) = (\mu , \mu ^2 + \sigma ^2) \triangleq (\omega _1, \omega _2)\). The parameter \(\varvec{\omega } = (\omega _1, \omega _2)\) corresponds to the mean-value parameterization. The inverse transformation is \(\mu = \omega _1\) and \(\sigma ^2 = \omega _2 - \omega _1^2\), and thus its Jacobian is
$$\begin{aligned} {\varvec{J}}_h^{-1} = \left( \begin{array}{cc} 1 &{} \quad 0 \\ -2\omega _1 &{} \quad 1 \\ \end{array} \right) . \end{aligned}$$
The variance of \(\varvec{s}(\varvec{y}_n)\) is given by
$$\begin{aligned} {\varvec{V}}(\varvec{\theta _\star }) = \left( \begin{array}{cc} Q &{} \quad 2 \omega _1 Q \\ 2 \omega _1 Q &{} \quad 4 \omega _1^2 Q + 2Q^2 \\ \end{array} \right) , \end{aligned}$$
where \(Q = \omega _2 - \omega _1^2 = \sigma ^2\). Thus the variance of \((\widehat{\omega _1}, \widehat{\omega _2})\) is \((1/n) {\varvec{V}}(\varvec{\theta _\star })\) and the variance of \((\widehat{\mu }, \widehat{\sigma ^2})\) is given by
$$\begin{aligned} \mathrm {Var}\left( (\widehat{\mu }, \widehat{\sigma ^2})\right)= & {} (1/n) {\varvec{J}}_h^{-1} {\varvec{V}}{\varvec{J}}_h^{-1 {\varvec{T}}} = (1/n) \left( \begin{array}{cc} Q &{} \quad 0 \\ 0 &{} \quad 2 Q^2 \\ \end{array} \right) \\= & {} (1/n) \left( \begin{array}{cc} \sigma ^2 &{} \quad 0 \\ 0 &{} \quad 2 \sigma ^4 \\ \end{array} \right) , \end{aligned}$$
which is exactly the asymptotic variance of the MLE estimate. In practice, however, the mean-value transformation is rarely possible. Still, the intuition of transforming the model parameters into location parameters can be very useful in many situations, even when such transformation is approximate.

3.4 Choice of learning rate sequence

An interesting observation on the asymptotic variance results (18) is that for any choice of the symmetric positive-definite matrix \({\varvec{C}}\),
$$\begin{aligned} (2 {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) - {\varvec{I}})^{-1} {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) {\varvec{C}}^\intercal \ge \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}, \end{aligned}$$
(27)
where \({\varvec{A}} \ge {\varvec{B}}\) for two matrices \({\varvec{A}}, {\varvec{B}}\) indicates that \({\varvec{A}} - {\varvec{B}}\) is nonnegative-definite. Even in second-order form, both methods incur an efficiency loss when compared to the maximum-likelihood estimator, which can be quantified exactly through (18). Thus, there are two ways to achieve asymptotic efficiency. First, one can design the condition matrix such that \(n {\varvec{C}}_n \rightarrow \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1} \triangleq {\varvec{C}}_\star \).5 However, this requires knowledge of the Fisher information matrix on the true parameters \(\varvec{\theta _\star }\), which is usually unknown. The Venter process (Venter 1967) was the first method to follow an adaptive approach to estimate this matrix, and was later analyzed and extended by several other authors (Fabian 1973; Lai and Robbins 1979; Amari et al. 2000; Bottou and Le Cun 2005). Adaptive methods that perform an approximation of the matrix \({\varvec{C}}_\star \) (e.g., through a Quasi-Newton scheme) have recently been applied with considerable success (Schraudolph et al. 2007; Bordes et al. 2009); see Sect. 3.5.2 for more details.
In contrast, an efficiency loss is generally unavoidable in first-order SGD i.e., when \({\varvec{C}}_n = a_n{\varvec{I}}\) with \(n a_n \rightarrow \alpha \). Asymptotic efficiency can occur only when \(\lambda _i = 1/\alpha \) i.e., when all eigenvalues \(\lambda _i\) of the Fisher information matrix \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\) are identical. When \(\lambda _i\)’s are distinct the eigenvalues of the asymptotic variance matrix \(n \mathrm {Var}(\varvec{\theta }_{n}^{\mathrm {sgd}})\) (or \(n \mathrm {Var}(\varvec{\theta }_{n}^{\mathrm {im}})\)) are \(\alpha ^2 \lambda _i / (2\alpha \lambda _i-1)\) which is at least \(1/\lambda _i\) for any \(\alpha \). In this case, one reasonable way to set the parameter \(\alpha \) would be to minimize the trace of the asymptotic variance matrix i.e., solve
$$\begin{aligned} \hat{\alpha } = \arg \min _{\alpha } \sum _i \alpha ^2 \lambda _i / (2\alpha \lambda _i-1), \end{aligned}$$
(28)
under the constraint that \(\alpha > 1 / (2 \lambda _{min})\), thus making an undesirable but necessary comprise for convergence in all parameter components. However, the eigenvalues \(\{ \lambda _i\}\) are unknown in practice and need to be estimated from the data. This problem has received significant attention recently and several methods exist (see Karoui 2008, and references within). A powerful alternative is to reparametrize the problem, apply SGD on the new parameter space, and then perform the inverse transformation, as in Sect. 3.3.

3.4.1 Practical considerations

There is voluminous amount of research literature on learning rate sequences for stochastic approximation and SGD. However, we decided to discuss this issue at the end of this section because the choice of the learning rate sequence conflates multiple design goals that are usually conflicting in practice, e.g., convergence (or bias), asymptotic variance, stability and so on.

In general, the theory presented so far indicates that the learning rate for first-order explicit SGD should be of the form \(a_n = \alpha (\alpha \beta + n)^{-1}\). Note that \(\lim _{n \rightarrow \infty } n a_n = \alpha \), so \(\alpha \) is indeed the learning rate parameter introduced in Sect. 1. Parameter \(\alpha \) will control the asymptotic variance and a reasonable choice would be the solution of (28), which requires estimates of the eigenvalues of the Fisher information matrix \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\). An easier method is to simply use \(\alpha = 1/\lambda _{min}\), where \(\lambda _{min}\) is the minimum eigenvalue of \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\); the value \(1/\lambda _{min}\) is an approximate solution for (28), and also has good empirical performance (Xu 2011; Toulis et al. 2014). Parameter \(\beta \) can be used to stabilize explicit SGD. In particular, one would want to control the variance of the stochastic gradient \(\mathrm {Var}\left( \nabla \ell (\varvec{\theta }_n; \varvec{y}_n)\right) = \varvec{\mathcal {I}}(\varvec{\theta _\star }) + \mathcal {O}(a_n)\), for points near \(\varvec{\theta _\star }\); see also the stability analysis in Sect. 3.2. One reasonable value would thus be \(\beta = \mathrm {trace}(\varvec{\mathcal {I}}(\varvec{\theta _\star }))\), which can be estimated easily by summing norms of the score function, i.e., \(\hat{\beta } = \sum _{i=1}^n ||\nabla \ell (\varvec{\theta }_i; \varvec{y}_i)||^2\), similar to (Amari et al. 2000; Duchi et al. 2011)—see also Sect. (3.5.2).

For implicit SGD, the situation is a bit easier because a learning rate sequence \(a_n = \alpha (\alpha + n)^{-1}\) works well in practice (Toulis et al. 2014). As before, \(\alpha \) controls the efficiency of the method and so we can set \(\alpha = 1/\lambda _{min}\) as in explicit SGD. The additional stability term (\(\beta \)) in explicit SGD is unnecessary because the implicit method performs such normalization (shrinkage) indirectly—see Eq. (4).

However, tuning the learning rate sequence eventually depends on problem-specific considerations, and there is a considerable variety of sequences that have been employed in practice (George and Powell 2006). Principled design of learning rates in SGD remains an important research topic (Schaul et al. 2012).

3.5 Some interesting extensions

3.5.1 Averaged stochastic gradient descent

Estimation with SGD can be optimized for statistical efficiency only with knowledge of the underlying model. For example, the optimal learning rate parameter \(\alpha \) in first-order SGD requires knowledge of the eigenvalues of the Fisher information matrix \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\). In second-order SGD, optimality is achieved when one uses a sequence of matrices \({\varvec{C}}_n\) such that \(n{\varvec{C}}_n \rightarrow \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}\). Methods that approximate \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\) make up a significant class of methods in stochastic approximation. Another important class of stochastic approximation methods relies on averaging of the iterates. The corresponding SGD procedure is usually referred to as averaged SGD, or ASGD for short.6

Averaging of iterates in the Robbins–Monro procedures was studied independently by Ruppert (1988) and Bather (1989), and both proposed similar averaging schemes. If we use the notation of Sect. 2 (see also iteration (6)), Ruppert (1988) considered the following stochastic approximation procedure
$$\begin{aligned}&\theta _n = \theta _{n-1} - a_n y_n, \nonumber \\&\bar{\theta }_n = \frac{1}{n} \sum _{i=1}^n \theta _i, \end{aligned}$$
(29)
where \(a_n = \alpha n^{-c}\) for \(1/2< c< 1\) and \(\bar{\theta }_n\) are the estimates of the zero of the regression function \(M(\theta )\). Under certain conditions, Ruppert (1988) showed that \(n \mathrm {Var}(\bar{\theta }_n) \rightarrow \sigma ^2 / M'(\theta _\star )^2\), where \(\sigma ^2 = \mathrm {Var}\left( y_n\right| \theta _\star )\). Recall, that the typical Robbins–Monro procedure gives estimates with asymptotic variance \(\alpha ^2 \sigma ^2 / (2\alpha M'(\theta _\star )-1)\), which is at least equal to the variance of the averaged iterate. Ruppert (1988) provides a nice statistical intuition on why averaging gives such efficiency with larger learning rates. First, write \(y_n = M(\theta _n) - \varepsilon _n\), where \(\varepsilon _n\) are zero-mean independent random variables with finite variance. The typical analysis in stochastic approximation starts by solving the recursion (6) to get an expression like the following
$$\begin{aligned} \theta _n - \theta _\star = \sum _{i=1}^n c(i, n) a_i \varepsilon _i + o(1), \end{aligned}$$
(30)
where \(c(i, n) = \exp \{-A(n) + A(i)\}\), \(A(m) = K \sum _{j=1}^m a_j\) is the function of partial sums, and \(K\) is some constant. Ruppert (1988) shows that Eq. (30) can be rewritten as
$$\begin{aligned} \theta _n - \theta _\star = a_n \sum _{i=b(n)}^n c(i, n) \varepsilon _i + o(1), \end{aligned}$$
(31)
where \(b(n) = \lfloor n - R n^c \log n\rfloor \) with \(R\) a positive constant, and \(\lfloor \cdot \rfloor \) the positive integer floor function. Ruppert (1988) argues that when \(a_n = \alpha /n\) then \(b(n) =\mathcal {O}(1)\), and \(\theta _n-\theta _\star \) is the weighted average of all noise variables \(\varepsilon _n\). When \(a_n = \alpha n^{-c}\) for \(1/2<c<1\), then \(\theta _n-\theta _\star \) is a weighted average of only \(\mathcal {O}(n^c \log n)\) noise variables. Thus, in the former case there is significant autocorrelation in the series \(\theta _n\). In the latter case, for \(0 < p_1 < p_2 < 1\) the variables \(\theta _{\lfloor p_1 n\rfloor }\) and \(\theta _{\lfloor p_2 n\rfloor }\) are asymptotically uncorrelated, and thus averaging improves the estimation efficiency.

Polyak and Juditsky (1992) derive further significant results for averaged SGD, showing in particular that ASGD can be asymptotically efficient as second-order SGD under certain mild assumptions. In fact, due to the authors’ prior work in averaged stochastic approximation, ASGD is usually referred to as Polyak-Ruppert averaging scheme. Adoption of averaging schemes for statistical learning has been slow but steady over the years (Zhang 2004; Nemirovski et al. 2009; Bottou 2010; Cappé 2011). One practical reason is that averaging only helps when the underlying stochastic process is slow to converge, which is hard to know in practice; in fact, averaging can have an adverse effect when the underlying SGD process is converging well. Furthermore, the selection of the learning rate sequence is also important in ASGD, and a bad sequence can cause the algorithm to converge very slowly (Xu 2011), or even diverge. Research on ASGD is still ongoing as several directions, such as the combination of stable methods with averaging schemes, remain unexplored (e.g., stochastic proximal methods, implicit SGD). Furthermore, in a similar line of work, several methods have been developed that use averaging in order to reduce the variance of stochastic gradients (Johnson and Zhang 2013; Wang et al. 2013).

3.5.2 Second-order stochastic gradient descent

Sakrison’s recursive estimation method (9) is the archetype of second-order SGD, but it requires an expensive matrix inversion at every iteration. Several methods have been developed that approximate such a matrix across iterations in stochastic approximation, and are generally termed adaptive. Early adaptive methods in stochastic approximation were given by Nevelson and Khasminskiĭ (1973) and Wei (1987); translated into a SGD procedure, such methods would recursively estimate \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\) by computing finite-differences \(\varvec{y}_{n, +}^j - \varvec{y}_{n, -}^j\) sampled at \(\varvec{\theta }_n+c_n \varvec{e}_j\) and \(\varvec{\theta }_n - c_n \varvec{e}_j\), respectively, where \(\varvec{e}_j\) is the \(j\)th unit basis vector and \(c_n\) is an appropriate sequence of positive numbers. While such methods are very useful in sequential experiment design where one has control over the data generation process, they are impractical for modern online learning problems.

A simple and effective approach was proposed by Amari et al. (2000). The idea is to keep an estimate \(\hat{{\varvec{\mathcal {I}}}}_{n}\) of \(\varvec{\mathcal {I}}(\varvec{\theta _\star })\) and use an explicit SGD scheme as follows:
$$\begin{aligned}&\hat{{\varvec{\mathcal {I}}}}_{n} = (1-c_n) \hat{{\varvec{\mathcal {I}}}}_{n-1} + c_n \nabla \ell (\varvec{\theta }_{n-1}; \varvec{y}_n) \nabla \ell (\varvec{\theta }_{n-1}; \varvec{y}_n)^\intercal , \nonumber \\&\varvec{\theta }_n = \varvec{\theta }_{n-1} + \hat{{\varvec{\mathcal {I}}}}_{n}^{-1} \nabla \ell (\varvec{\theta }_{n-1}; \varvec{y}_n). \end{aligned}$$
(32)
Inversion of the estimate \(\hat{{\varvec{\mathcal {I}}}}_{n}\) is (relatively) cheap by using the Sherman-Morrison formula. This scheme, however, introduces the additional problem of determining the sequence \(c_n\) in (32). In their work, Amari et al. (2000) advocated for a small constant \(c_n = c>0\) that can be determined through computer simulations.
Another notable approach based on Quasi-Newton methods (see Sect. 1) was developed by Bordes et al. (2009). Their method, termed SGD-QN, approximates the Fisher information matrix through a secant condition as in the original BFGS algorithm (Broyden 1965). The secant condition in SGD-QN is
$$\begin{aligned} \varvec{\theta }_n \!-\! \varvec{\theta }_{n-1} \!\approx \! \hat{{\varvec{\mathcal {I}}}}_{n-1}^{-1} \left[ \nabla \ell (\varvec{\theta }_n; \varvec{y}_n) \!-\! \nabla \ell (\varvec{\theta }_{n-1}; \varvec{y}_n) \right] \triangleq \hat{{\varvec{\mathcal {I}}}}_{n-1}^{-1} \varvec{\delta }_n, \end{aligned}$$
(33)
where \(\hat{{\varvec{\mathcal {I}}}}_{n}\) is kept diagonal. If we let \(\varvec{D}_n\) denote the diagonal matrix with \(i\)th diagonal element \(d_{ii} = (\theta _{n,i} - \theta _{n-1, i}) / \delta _{n, i}\), then the update of the approximation matrix in SGD-QN is given by
$$\begin{aligned} \hat{{\varvec{\mathcal {I}}}}_{n} \leftarrow \hat{{\varvec{\mathcal {I}}}}_{n-1} + \frac{2}{r} (\varvec{D}_n - \hat{{\varvec{\mathcal {I}}}}_{n-1}), \end{aligned}$$
(34)
and the update of \(\varvec{\theta }_n\) is similar to (32). The parameter \(r\) is controlled internally in the algorithm, and counts the number of times the update (34) has been performed.
A notable second-order method is also AdaGrad (Duchi et al. 2011), which adapts multiple learning rates using gradient information. In one popular variant of the method, AdaGrad keeps a diagonal \((p \times p)\) matrix \({\varvec{A}}_n\) of learning rates that are updated at every iteration. Upon observing data \(\varvec{y}_n\), AdaGrad updates \({\varvec{A}}_n\) as follows:
$$\begin{aligned} {\varvec{A}}_n = {\varvec{A}}_{n-1} + \mathrm {diag}(\nabla \ell (\varvec{\theta }_{n-1}; \varvec{y}_n) \nabla \ell (\varvec{\theta }_{n-1}; \varvec{y}_n)^\intercal ), \end{aligned}$$
(35)
where \(\mathrm {diag}({\varvec{A}})\) is the diagonal matrix with the same diagonal as its matrix argument \({\varvec{A}}\). Learning in AdaGrad proceeds through the iteration
$$\begin{aligned} \varvec{\theta }_n = \varvec{\theta }_{n-1} + \alpha {\varvec{A}}_n^{-1/2} \circ \nabla \ell (\varvec{\theta }_{n-1}; \varvec{y}_n), \end{aligned}$$
(36)
where \(\alpha > 0\) is a learning rate parameter that is shared among all parameter components, and the symbol \(\circ \) denotes elementwise multiplication. The original motivation for AdaGrad stems from proximal methods in optimization, but there is a statistical intuition why the update (36) is reasonable. In general, from an information perspective, a learning rate sequence \(a_n\) discounts an observation \(\varvec{y}_n\) according to the reciprocal of the statistical information that has been gathered so far for the parameter of interest \(\varvec{\theta _\star }\). The intuition behind a rate of the form \(a_n = \alpha / n\) is that the information after \(n\) iterations is proportional to \(n\), under the i.i.d. data assumption. In many dimensions where some parameter component affects outcomes less frequently than others, AdaGrad replaces the term \(n\) with an estimate of the information that has actually been received for that component. A (biased) estimate of this information is provided by the elements of \({\varvec{A}}_n\) in (36), and is justified since \(\mathbb {E}\left( \nabla \ell (\varvec{\theta }; \varvec{y}_n) \nabla \ell (\varvec{\theta }; \varvec{y}_n)^\intercal \right) = \varvec{\mathcal {I}}(\varvec{\theta })\). Interestingly, implicit SGD and AdaGrad share the common property of shrinking explicit SGD estimates according to the Fisher information matrix. Second-order implicit SGD methods are yet to be explored, but further connections are possible.

3.5.3 Monte-Carlo stochastic gradient descent

A key requirement for the application of SGD procedures is that the likelihood is easy to evaluate. However, in many situations that are important in practice, this is not possible, for example when the likelihood is only known up to a normalizing constant. In such cases, definitions (10) and (11) cannot be applied directly since \(\nabla \ell (\varvec{\theta }; \varvec{y}_n)\) cannot be computed. However, if unbiased samples of the log-likelihood gradients are available, then explicit SGD can be readily applied. This is possible if sampling from the model is relatively easy.

In particular, assume an exponential family model (12) that is easy to sample from, e.g., through Metropolis-Hastings. A variant of explicit SGD, termed Monte-Carlo SGD (Toulis and Airoldi 2014), can be constructed as follows. Starting from some estimate \(\varvec{\theta }^{\mathrm {mc}}_0\), iterate the following steps for each \(n\)th data point \(\varvec{y}_n\), where \(n=1, 2, \ldots ,N\):
  1. 1.

    Get \(m\) samples from the model \(\widetilde{\varvec{y}_i} \sim f(\cdot ; \varvec{\theta }^{\mathrm {mc}}_{n-1})\), \(i = 1, 2, \ldots , m\).

     
  2. 2.

    Compute average sufficient statistic \(\widetilde{\varvec{s}_n}=(1/m) \sum _{i=1}^m \varvec{s}(\widetilde{\varvec{y}_i})\).

     
  3. 3.
    Update the estimate through
    $$\begin{aligned} \varvec{\theta }^{\mathrm {mc}}_n = \varvec{\theta }^{\mathrm {mc}}_{n-1} + {\varvec{C}}_n (\varvec{s}(\varvec{y}_{n})- \widetilde{\varvec{s}_n}). \end{aligned}$$
    (37)
     
The main idea of a Monte-Carlo SGD algorithm (37) is to use the current parameter estimate in order to impute the expected value of the sufficient statistic that would otherwise be available if the likelihood was easy to evaluate. Furthermore, assuming \(n {\varvec{C}}_n \rightarrow {\varvec{C}}\), the asymptotic variance of the estimate satisfies
$$\begin{aligned} n \mathrm {Var}\left( \varvec{\theta }^{\mathrm {mc}}_n\right) \rightarrow (1+1/m) \cdot (2{\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star })-{\varvec{I}})^{-1} {\varvec{C}} \varvec{\mathcal {I}}(\varvec{\theta _\star }) {\varvec{C}}^\intercal , \end{aligned}$$
(38)
which exceeds the variance of the typical explicit SGD estimator by a factor of \((1+1/m)\). However, in its current form the Monte-Carlo SGD (37) is only explicit; an implicit version would require to sample data from the next iterate, which is technically challenging but an interesting open problem. Still, an approximate implicit implementation of Monte-Carlo SGD is possible using the intuition in Eq. (4). For example, one could simply run an explicit update as in (37), but then shrink according to \(({\varvec{I}} + a_n \varvec{\mathcal {I}}(\varvec{\theta }^{\mathrm {mc}}_n))^{-1}\), or more efficiently using a one-dimensional shrinkage factor \((1 + a_n \mathrm {trace}(\varvec{\mathcal {I}}(\varvec{\theta }^{\mathrm {mc}}_n)))^{-1}\), for some decreasing sequence \(a_n>0\).

Theoretically Monte-Carlo SGD is based on sampling-controlled stochastic approximation methods in which the usual regression function of the Robbins–Monro procedure (6) is only accessible through sampling (Dupuis and Simha 1991), e.g., through MCMC. Convergence in such settings is subtle because it also depends on the ergodicity of the underlying Markov chain (Younes 1999). In practice, approximate variants of the aforementioned Monte-Carlo SGD procedure have been applied with considerable success to fit large models of neural networks, notably through the contrastive divergence algorithm, as we briefly discuss in Sect. 4.4.

4 Selected applications

SGD has found several important applications over the years. In this section we will review some of them, giving a preference to breadth over depth.

4.1 Online EM algorithm

The Expectation–Maximization algorithm (Dempster et al. 1977) is a numerically stable procedure to compute the maximum-likelihood estimator in latent variable models. Extending our notation, let \(\varvec{x}_n\) denote a latent variable at observed-data point \(\varvec{y}_n\), and let \(f_{\mathrm {com}}(\varvec{x}_n, \varvec{y}_n; \varvec{\theta })\) and \(f_{\mathrm {obs}}( \varvec{y}_n; \varvec{\theta })\) denote the complete-data and observed-data density, respectively; similarly, \(\ell _{\mathrm {com}}\) and \(\ell _{\mathrm {obs}}\) will denote the respective log-likelihoods. For simplicity, we will assume that \(f_{\mathrm {com}}\) is an exponential family model in the natural parameterization, as in (12), such that
$$\begin{aligned} f_{\mathrm {com}}(\varvec{x}_n, \varvec{y}_n;\varvec{\theta }) \!=\! \exp \left\{ \varvec{s}(\varvec{x}_n, \varvec{y}_n)^\intercal \varvec{\theta }- A(\varvec{\theta }) \!+\! B(\varvec{x}_n, \varvec{y}_n) \right\} . \end{aligned}$$
(39)
We will denote the corresponding Fisher information matrices as \(\mathcal {I}_{\mathrm {com}}(\varvec{\theta }) = -\mathbb {E}\left( \nabla \nabla \ell _{\mathrm {com}}(\varvec{x}_n, \varvec{y}_n; \varvec{\theta }) \right) \) and \(\mathcal {I}_{\mathrm {obs}}(\varvec{\theta }) = \mathbb {E}\left( \nabla \nabla \ell _{\mathrm {obs}}(\varvec{y}_n; \varvec{\theta }) \right) \), where the expectations are considered with model parameters fixed at \(\varvec{\theta }\). Furthermore, let \({\varvec{Y}} = (\varvec{y}_1, \ldots , \varvec{y}_N)\) denote the entire observed dataset as in Sect. 1, and \({\varvec{X}} = (\varvec{x}_1, \ldots , \varvec{x}_N)\) be the corresponding latent variables. The traditional EM algorithm proceeds by iterating the following steps.
$$\begin{aligned}&Q(\varvec{\theta }, \varvec{\theta }_n; {\varvec{Y}}) = \mathbb {E}\left( \ell _{\mathrm {com}}({\varvec{X}}, {\varvec{Y}}; \varvec{\theta }) \right| \varvec{\theta }_n, {\varvec{Y}}), \quad&\mathbf{E-step} \end{aligned}$$
(40)
$$\begin{aligned}&\varvec{\theta }_{n+1} = \arg \max _{\varvec{\theta }} Q(\varvec{\theta }, \varvec{\theta }_n; {\varvec{Y}}). \quad&\mathbf{M-step} \end{aligned}$$
(41)
Dempster et al. (1977) showed that the EM algorithm converges to the maximum-likelihood estimator \(\hat{\varvec{\theta }} = \arg \max _{\varvec{\theta }} \ell _{\mathrm {obs}}({\varvec{Y}}; \varvec{\theta })\); furthermore, they showed that EM is an ascent algorithm i.e., the likelihood is strictly increasing at each iteration, and thus EM has a desirable numerical stability. However, the EM algorithm is impractical for the analysis of large datasets because it involves expensive operations, both in the expectation and maximization steps that need to be performed on the entire dataset. Therefore, online schemes are necessary for analysis of large models with latent variables.
Titterington (1984) considered a procedure defined through the iteration
$$\begin{aligned} \varvec{\theta }_n = \varvec{\theta }_{n-1} + a_n \mathcal {I}_{\mathrm {com}}(\varvec{\theta }_{n-1})^{-1} \nabla \ell _{\mathrm {obs}}(\varvec{y}_n; \varvec{\theta }_{n-1}). \end{aligned}$$
(42)
This procedure differs only marginally from Sakrison’s recursive estimation method (see Sect. 2.2) by using the complete-data information matrix. In the univariate case where the true model parameter is \(\theta _\star \), Titterington (1984) applied Fabian’s theorem (Fabian 1968) to show that the estimate in (42) satisfies \(\sqrt{n} (\theta _n - \theta _\star ) \sim \mathcal {N}(0, \mathcal {I}_{\mathrm {com}}(\theta _\star )^{-2} \mathcal {I}_{\mathrm {obs}}(\theta _\star ) / (2 \mathcal {I}_{\mathrm {obs}}(\theta _\star ) \mathcal {I}_{\mathrm {com}}(\theta _\star )^{-1} - 1)\). Thus, as in the traditional full-data EM algorithm, the efficiency of the online method (42) depends on the amount of missing information. Notably, Lange (1995) considered Newton–Raphson iterations for the M-step of the EM algorithm, and derived an online procedure that is similar to (42).
However, procedure (42) is essentially an explicit stochastic gradient method, and thus it may have serious stability and convergence problems, contrary to the desirable numerical stability of EM. In the exponential family model (39), Nowlan (1991) considered one of the first “true” online EM algorithms as follows:
$$\begin{aligned} \varvec{s}_{n+1}&= \varvec{s}_n + \alpha \mathbb {E}\left( \varvec{s}(\varvec{x}_n, \varvec{y}_n; \varvec{\theta }_n) \right| \varvec{\theta }_n, \varvec{y}_n), \quad&\mathbf{E-step} \nonumber \\ \varvec{\theta }_{n+1}&= \arg \max _{\varvec{\theta }} \ell _{\mathrm {com}}(\varvec{s}_{n+1}; \varvec{\theta }), \quad&\mathbf{M-step} \end{aligned}$$
(43)
where \(\alpha \in (0, 1)\). In words, the algorithm starts from some initial sufficient statistic \(\varvec{s}_0\) and then updates it through a stochastic approximation scheme with a constant-step size \(\alpha \). The maximization step is identical to that of traditional EM. Online EM with decreasing step sizes was later developed by Sato and Ishii (2000) as follows:
$$\begin{aligned} \varvec{s}_{n+1}&= \varvec{s}_n + a_n \left[ \mathbb {E}\left( \varvec{s}(\varvec{x}_n, \varvec{y}_n; \varvec{\theta }_n) \right| \varvec{\theta }_n, \varvec{y}_n) - \varvec{s}_n \right] , \,&\mathbf{E-step} \nonumber \\ \varvec{\theta }_{n+1}&= \arg \max _{\varvec{\theta }} \ell _{\mathrm {com}}(\varvec{s}_{n+1}; \varvec{\theta }). \,&\mathbf{M-step} \end{aligned}$$
(44)
By the theory of stochastic approximation, procedure (44) will converge to the observed-data maximum-likelihood estimate \(\hat{\varvec{\theta }}\). In contrast, procedure (43) will not converge with a constant \(\alpha \), but it will reach a point in the vicinity of \(\hat{\varvec{\theta }}\) more rapidly than (44). Further extensions of the aforementioned online EM algorithms have been developed by several authors (Neal and Hinton 1998; Cappé and Moulines 2009). Examples of a growing body of applications of such methods can be found in (Neal and Hinton 1998; Sato and Ishii 2000; Liu et al. 2006; Cappé 2011).

4.2 MCMC sampling

Let \(\varvec{\theta }\) be the model parameters of observations \({\varvec{Y}} = (\varvec{y}_1, \cdots \varvec{y}_N)\), with an assumed prior distribution denoted by \(\pi (\varvec{\theta })\). A common task in Bayesian statistics it to sample from the posterior distribution \(f(\varvec{\theta }| {\varvec{Y}}) \propto \pi (\varvec{\theta }) f({\varvec{Y}}|\varvec{\theta })\). The Hamiltonian Monte-Carlo (HMC) (Neal 2011) is a method in which auxiliary variables \(\varvec{p}\) are introduced to the original variables \(\varvec{\theta }\) to improve sampling from \(f(\varvec{\theta }| {\varvec{Y}})\). In the augmented parameter space, we consider a function \(H(\varvec{\theta }, \varvec{p})= U(\varvec{\theta }) + K(\varvec{p}) \in \mathbb {R}^{+}\), where \(U(\varvec{\theta }) = -\log f(\varvec{\theta }| {\varvec{Y}})\) and \(K(\varvec{p}) = (1/2) \varvec{p}^\intercal {\varvec{M}} \varvec{p}\) with a symmetric positive-definite matrix \({\varvec{M}}\). Next, we consider the density
$$\begin{aligned} h(\varvec{\theta }, \varvec{p}| {\varvec{Y}})&= \mathrm {exp} \{-H(\varvec{\theta }, \varvec{p})\} = \mathrm {exp} \{-U(\varvec{\theta }) -K(\varvec{p}) \}\\&= f(\varvec{\theta }|{\varvec{Y}}) \times \mathcal {N}(\varvec{p}, {\varvec{M}}^{-1}). \end{aligned}$$
In this parameterization, the variables \(\varvec{p}\) are independent of \(\varvec{\theta }\). Assuming some initial state \((\varvec{\theta }_0, \varvec{p}_0)\), HMC sampling proceeds in iterations indexed by \(n=1, \cdots \), as follows:
  1. 1.

    Sample \(\varvec{p}^* \sim \mathcal {N}(\varvec{0}, {\varvec{M}}^{-1})\).

     
  2. 2.

    Using Hamiltonian dynamics, compute \((\varvec{\theta }_n, \varvec{p}_n) = \mathrm {ODE}(\varvec{\theta }_{n-1}, p^*)\).

     
  3. 3.

    Perform a typical Metropolis-Hastings step for the proposed transition \((\varvec{\theta }_{n-1}, \varvec{p}^*) \rightarrow (\varvec{\theta }_n, \varvec{p}_n)\) with acceptance probability that is equal to \(\min [1, \mathrm {exp}(-H(\varvec{\theta }_n, \varvec{p}_n) + H(\varvec{\theta }_{n-1}, \varvec{p}^*)]\).

     
Step 2. is the key idea in HMC. The variables \((\varvec{\theta }, \varvec{p})\) can be mapped to a physical system where \(\varvec{\theta }\) is the position of the system, and \(\varvec{p}\) is the momentum. The Hamiltonian dynamics refer to a set of ordinary differential equations (ODE) that govern the movement of the system, and thus calculate the future values of \((\varvec{\theta }, \varvec{p})\) given a pair of current values. Being a closed physical system, the Hamiltonian of the system is constant. Thus, in Step 3. of HMC it holds \(-H(\varvec{\theta }_n, \varvec{p}_n) + H(\varvec{\theta }_{n-1}, \varvec{p}^*) =0\), and thus the acceptance probability is one.
A special case of HMC, called Langevin dynamics, defines the sampling iterations as follows (Girolami and Calderhead 2011):
$$\begin{aligned}&\varvec{\eta }_n \sim \mathcal {N}(\varvec{0}, \epsilon {\varvec{I}}), \nonumber \\&\varvec{\theta }_n= \varvec{\theta }_{n-1}+ \frac{\epsilon }{2} (\nabla \log \pi (\varvec{\theta }_{n-1}) + \log f(\varvec{\theta }_{n-1}; {\varvec{Y}})) + \varvec{\eta }_n. \end{aligned}$$
(45)
The sampling procedure (45) follows from HMC by a numerical solution of the \(\mathrm {ODE}\) method in Step 2. of the algorithm using the leapfrog method. Parameter \(\epsilon > 0\) determines the size of the leapfrog in the numerical solution of Hamiltonian differential equations.
Welling and Teh (2011) studied a simple modification of Langevin dynamics (45) using a stochastic gradient as follows:
$$\begin{aligned} \varvec{\eta }_n&\sim \, \mathcal {N}(0, \epsilon _n), \nonumber \\ \varvec{\theta }_n&=\, \varvec{\theta }_{n-1}+ \frac{\epsilon _n}{2} \left( \nabla \log \pi (\varvec{\theta }_{n-1}) + (N/b)\right. \nonumber \\&\left. \qquad \times \sum _{i \in \mathrm {batch}} \nabla \log f(\varvec{y}_i | \varvec{\theta }_{n-1})\right) + \varvec{\eta }_n. \end{aligned}$$
(46)
The step sizes \(\epsilon _n\) satisfy the typical requirements in stochastic approximation i.e., \(\sum \epsilon _i = \infty \) and \(\sum \epsilon _i^2 < \infty \). Procedure (46) is using stochastic gradients averaged over a mini-batch of \(b\) samples that are usually employed in SGD to reduce noise in the stochastic gradients. Notably, Sato and Nakagawa (2014) proved that procedure (46) converges to the true posterior \(f(\varvec{\theta }| {\varvec{Y}})\) with an elegant use of stochastic calculus. Sampling through stochastic gradient Langevin dynamics has since generated a lot of significant work in MCMC sampling for very large datasets, and it is still a rapidly expanding research area with contributions from various disciplines (Hoffman et al. 2013; Pillai and Smith 2014; Korattikara et al. 2014).

4.3 Reinforcement learning

Reinforcement learning is the multidisciplinary study of how autonomous agents perceive, learn, and interact with their environment (Bertsekas and Tsitsiklis 1995). Typically, it is assumed that time \(t\) proceeds in discrete steps and at every step an agent is at state \(\varvec{x}_t \in \mathcal {X}\), where \(\mathcal {X}\) is some state space. Upon entering a state \(\varvec{x}_t\) two things happen. First, an agent receives a probabilistic reward\(R(\varvec{x}_t) \in \mathbb {R}^{}\), and then takes an action\(a \in \mathcal {A}\), where \(\mathcal {A}\) denotes the action-space. This action is determined by the agent’s policy, which is a function \(\pi : \mathcal {X} \rightarrow \mathcal {A}\), thus mapping a state to an action. Nature then decides a transition to state \(\varvec{x}_{t+1}\) through a density \(p(\varvec{x}_{t+1} | \varvec{x}_t)\) that is unknown to the agent.

One important task in reinforcement learning is to estimate the value function\(V^\pi (\varvec{x})\) which quantifies the expected value of a specific state \(\varvec{x} \in \mathcal {X}\) for an agent. This is defined as
$$\begin{aligned} V^{\pi }(\varvec{x}) = \mathbb {E}\left( R(\varvec{x}) \right) + \gamma \mathbb {E}\left( R(\varvec{x}_1) \right) + \gamma ^2 \mathbb {E}\left( R(\varvec{x}_2) \right) + \cdots , \end{aligned}$$
(47)
where \(\varvec{x}_t\) denotes the state that will be reached starting at \(\varvec{x}\) after \(t\) transitions, and \(\gamma \in (0, 1)\) is a parameter that discounts future rewards. Note that the variation of \(R(\varvec{x}_t)\) includes the uncertainty of the state \(\varvec{x}_t\) because of the stochasticity in transitions, and the uncertainty from the reward distribution. Thus, \(V^\pi (\varvec{x})\) admits a recursive definition as follows:
$$\begin{aligned} V^{\pi }(\varvec{x}) = \mathbb {E}\left( R(\varvec{x}) \right) + \gamma \mathbb {E}\left( V^{\pi }(\varvec{x}_1) \right) . \end{aligned}$$
(48)
When the state is a high-dimensional vector, one popular approach is to use the linear value approximation\(V(\varvec{x}) = \varvec{\theta _\star }^\intercal \phi (\varvec{x})\), where \(\phi (\varvec{x})\) maps a state to features in a space with fewer dimensions, and \(\varvec{\theta _\star }\) is a vector of fixed parameters. If an agent is at state \(\varvec{x}_t\), then the recursive equation (48) can be rewritten as
$$\begin{aligned} \mathbb {E}\left( R(\varvec{x}_t) - (\varvec{\theta _\star }^\intercal \varvec{\phi }_{t} - \gamma \varvec{\theta _\star }^\intercal \varvec{\phi }_{t+1}) \right| \varvec{\phi }_{t}) = 0, \end{aligned}$$
(49)
where we set \(\varvec{\phi }_t = \phi (\varvec{x}_t)\) for notational convenience. Similar to SGD, this suggests a stochastic approximation method to estimate \(\varvec{\theta _\star }\) through the following iteration:
$$\begin{aligned} \varvec{\theta }_{t+1} = \varvec{\theta }_t + a_t \left[ R(\varvec{x}_{t}) - \left( \varvec{\theta }_{t}^\intercal \varvec{\phi }_{t} - \gamma \varvec{\theta }_{t}^\intercal \varvec{\phi }_{t+1} \right) \right] \varvec{\phi }_{t}, \end{aligned}$$
(50)
where \(a_t\) is a learning rate sequence that satisfies the Robbins–Monro conditions (see Sect. 2.1). Equation (50) is known as the temporal differences (TD) learning algorithm (Sutton 1988). Implicit versions of this algorithm have recently emerged in order to solve some of the known stability issues of the classical TD algorithm (Schapire and Warmuth 1996; Li 2008; Wang and Bertsekas 2013; Tamar et al. 2014). For example, Tamar et al. (2014) consider computing the term \(\varvec{\theta }_t^\intercal \varvec{\phi }_t\) at the future iterate, and thus the resulting implicit TD algorithm is
$$\begin{aligned} \varvec{\theta }_{t+1} = ({\varvec{I}} + a_t \varvec{\phi }_t \varvec{\phi }_t^\intercal )^{-1} \left[ \varvec{\theta }_t + a_t ( R(\varvec{x}_{t}) + \gamma \varvec{\theta }_{t}^\intercal \varvec{\phi }_{t+1}) \varvec{\phi }_{t} \right] . \end{aligned}$$
(51)
Similar to implicit SGD, iteration (51) stabilizes the TD iterations. With the advent of online multiagent markets, methods and applications in reinforcement learning have been receiving a renewed stream of research effort (Gosavi 2009).

4.4 Deep learning

Deep learning is the task of estimating parameters of statistical models that can be represented by multiple layers of nonlinear operations, such as neural networks (Bengio 2009). Such models, also referred to as deep architectures, consist of units that can perform a basic prediction task, and are grouped in layers such that the output of one layer forms the input of another layer that sits directly on top. Furthermore, in most situations the models are augmented with latent units that are defined to represent structured quantities of interest, such as edges or shapes in an image.

One basic building block of deep architectures is the Restricted Boltzmann Machine (RBM). The complete-data density for an observation \((\varvec{x}, \varvec{y})\) of the states of hidden and observed input units, respectively, is given by
$$\begin{aligned} P(\varvec{x}, \varvec{y}; \varvec{\theta }) = \frac{\exp \{ -\varvec{b}' \varvec{y}- \varvec{c}'\varvec{x} - \varvec{x}' \varvec{W} \varvec{y}\}}{Z(\varvec{\theta })}, \end{aligned}$$
(52)
where \(\varvec{\theta }= (\varvec{b}, \varvec{c}, \varvec{W})\) are the model parameters, and the normalizing constant is \(Z(\varvec{\theta }) = \sum _{\varvec{x}, \varvec{y}} \exp \{ -\varvec{b}' \varvec{y}- \varvec{c}'\varvec{x} - \varvec{x}' \varvec{W} \varvec{y}\}\) (also known as the partition function). Furthermore, the sample spaces for \(\varvec{x}\) and \(\varvec{y}\) are discrete (e.g., binary) and finite. The observed-data density is thus \(P(\varvec{y}; \varvec{\theta }) = \sum _{\varvec{x}} P(\varvec{x}, \varvec{y};\varvec{\theta })\). Let \(H(\varvec{x}, \varvec{y}; \varvec{\theta }) = \varvec{b}' \varvec{y}+ \varvec{c}'\varvec{x} + \varvec{x}' \varvec{W} \varvec{y}\), such that \(P(\varvec{x}, \varvec{y}; \varvec{\theta }) = \frac{e^{-H(\varvec{x}, \varvec{y}; \varvec{\theta })}}{Z(\varvec{\theta })}\). Through simple algebra one can obtain the log-likelihood of an observed sample \(\varvec{y}^{\mathrm {obs}}\) in the following convenient form:
$$\begin{aligned} \nabla \ell (\varvec{\theta }; \varvec{y}^{\mathrm {obs}}) \!=\! -\!\left[ \! \mathbb {E}\left( \nabla H(\varvec{x}, \varvec{y}; \varvec{\theta }) \right) \!-\!\mathbb {E}\left( \nabla H(\varvec{x}, \varvec{y}; \varvec{\theta }) \right| \varvec{y}^{\mathrm {obs}}) \right] .\nonumber \\ \end{aligned}$$
(53)
In practical situations, the data \(\varvec{x}, \varvec{y}\) are binary. Therefore, the conditional distribution of the missing data \(\varvec{x}| \varvec{y}\) is readily available through the usual logistic regression GLM model, and thus the second term of (53) is easy to sample from. Similarly, \(\varvec{y}|\varvec{x}\) is easy to sample from. However, the first term in (53) requires sampling from the joint distribution of the complete-data \((\varvec{x}, \varvec{y})\), which conceptually is easy to sample from using the aforementioned conditionals and a Gibbs sampling scheme (Geman and Geman 1984). However, the state space for both \(\varvec{x}\) and \(\varvec{y}\) is usually very large, e.g., comprised of thousands or millions of units, and thus a full Gibbs on the joint distribution is usually impossible.
The method of contrastive divergence (Hinton 2002; Carreira-Perpinan and Hinton 2005) has been applied for training such models with considerable success. The algorithm proceeds as follows for steps \(i=1, 2,\ldots \):
  1. 1.

    Sample one state \(\varvec{y}^{\mathrm {( i)}}\) from the empirical distribution of observed states.

     
  2. 2.

    Sample \(\varvec{x}^{\mathrm {( i)}} | \varvec{y}^{\mathrm {( i)}}\) i.e., the hidden state.

     
  3. 3.

    Sample \(\varvec{y}^{\mathrm {({ i}, new)}} | \varvec{x}^{\mathrm {( i)}}\).

     
  4. 4.

    Sample \(\varvec{x}^{\mathrm {({ i}, new)}} | \varvec{y}^{\mathrm {({ i}, new)}}\).

     
  5. 5.

    Evaluate the gradient (53) using \((\varvec{x}^{\mathrm {( i)}}, \varvec{y}^{\mathrm {( i)}})\) for the second term, and the sample \((\varvec{x}^{\mathrm {({ i}, new)}}, \varvec{y}^{\mathrm {({ i}, new)}})\) for the first term.

     
  6. 6.

    Update the parameters in \(\varvec{\theta }\) using constant-step size SGD and the estimated gradient from Step 4.

     
In other words, contrastive divergence estimates both terms of (53). This estimation is biased because \((\varvec{x}^{\mathrm {({ i}, new)}}, \varvec{y}^{\mathrm {({ i}, new)}})\) is assumed to be from the full joint distribution of \((\varvec{x}, \varvec{y})\). In fact, contrastive divergence might operate in \(k\) steps in which the Steps 3–4 are repeated \(k\) times, in an effort to approximate the joint distribution better by letting the chain run longer. Although in theory larger \(k\) should approximate the full joint better, it has been observed that \(k=1\) is enough for good performance in many learning tasks (Hinton 2002; Taylor et al. 2006; Salakhutdinov et al. 2007; Bengio 2009; Bengio and Delalleau 2009).

Footnotes

  1. 1.

    Second-order methods typically use the Hessian matrix of second-order derivatives of the log-likelihood and are discussed in detail in Sect. 3.

  2. 2.

    Procedure (2) is actually an ascent algorithm because it aims to maximize the log-likelihood, and thus a more appropriate name would be stochastic gradient ascent. However, we will use the term “descent” in order to keep in line with the relevant optimization literature, which traditionally considers minimization problems through descent algorithms.

  3. 3.

    The solution of the fixed-point equation (3) requires additional computations per iterations. However, Toulis et al. (2014) derive a computationally efficient implicit algorithm in the context of generalized linear models. Furthermore, approximate solutions of implicit updates are possible for any statistical model (see Eq. (4)).

  4. 4.

    This is an important distinction because, traditionally, the focus in optimization has been to obtain fast convergence to some point \(\widehat{\varvec{\theta }}\) that minimizes the empirical loss, e.g., the maximum-likelihood estimator. From a statistical viewpoint, under variability of the data, there is a trade-off between convergence to an estimator and its asymptotic variance (Le et al. 2004).

  5. 5.

    Similarly, a sequence of matrices \({\varvec{C}}_n\) can be designed such that \({\varvec{C}}_n \rightarrow \varvec{\mathcal {I}}(\varvec{\theta _\star })^{-1}\) (Sakrison 1965).

  6. 6.

    The acronym ASGD is also used in machine learning to denote asynchronous SGD i.e., a variant of SGD that can be parallelized on multiple machines. We will not consider this variant here.

Notes

Acknowledgments

The authors wish to thank Leon Bottou, Bob Carpenter, David Dunson, Andrew Gelman, Brian Kulis, Xiao-Li Meng, Natesh Pillai, Neil Shephard, Daniel Sussman and Alexander Volfovsky for useful comments and discussion. This research was sponsored, in part, by NSF CAREER award IIS-1149662, ARO MURI award W911NF-11-1-0036, and ONR YIP award N00014-14-1-0485. PT is a Google Fellow in Statistics. EMA is an Alfred P. Sloan Research Fellow.

References

  1. Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)MathSciNetCrossRefGoogle Scholar
  2. Amari, S.-I., Park, H., Kenji, F.: Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Comput. 12(6), 1399–1409 (2000)CrossRefGoogle Scholar
  3. Bather, J.A.: Stochastic Approximation: A Generalisation of the Robbins–Monro Procedure, vol. 89. Cornell University, Mathematical Sciences Institute, New York (1989)Google Scholar
  4. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)MATHMathSciNetCrossRefGoogle Scholar
  5. Bengio, Y.: Learning deep architectures for ai. Foundations and trends \(\textregistered \). Mach. Learn. 2, 1–127 (2009)MATHCrossRefGoogle Scholar
  6. Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Comput. 21(6), 1601–1621 (2009)MATHMathSciNetCrossRefGoogle Scholar
  7. Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer Publishing Company, Incorporated, New York (2012)Google Scholar
  8. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic programming: an overview. In: Proceedings of the 34th IEEE Conference on Decision and Control, vol. 1, pp. 560–564 (1995)Google Scholar
  9. Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: careful quasi-Newton stochastic gradient descent. J. Mach. Learn. Res. 10, 1737–1754 (2009)MATHMathSciNetGoogle Scholar
  10. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer, New York (2010)Google Scholar
  11. Bottou, L., Le Cun, Y.: On-line learning for very large data sets. Appl. Stoch. Models Bus. Ind. 21(2), 137–151 (2005)MATHMathSciNetCrossRefGoogle Scholar
  12. Bousquet, O., Bottou, L.: The tradeoffs of large scale learning. Adv. Neural Inf. Process. Syst. 20, 161–168 (2008)Google Scholar
  13. Broyden, C.G.: A class of methods for solving nonlinear simultaneous equations. Math. Comput. 19, 577–593 (1965)MATHMathSciNetCrossRefGoogle Scholar
  14. Cappé, O.: Online em algorithm for hidden Markov models. J. Comput. Graph. Stat. 20(3), 728–749 (2011)CrossRefGoogle Scholar
  15. Cappé, O., Moulines, M.: On-line expectation-maximization algorithm for latent data models. J. R. Stat. Soc. 71(3), 593–613 (2009)MATHMathSciNetCrossRefGoogle Scholar
  16. Carreira-Perpinan, M.A., Hinton, G.E.: On contrastive divergence learning. In: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pp. 33–40. Citeseer (2005)Google Scholar
  17. Cheng, L., Vishwanathan, S.V.N., Schuurmans, D., Wang, S., Caelli, T.: Implicit online learning with kernels. In: Proceedings of the 2006 Conference Advances in Neural Information Processing Systems 19, vol. 19, p. 249. MIT Press, Cambridge, 2007Google Scholar
  18. Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat. 25, 463–483 (1954)MATHCrossRefGoogle Scholar
  19. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)MATHMathSciNetGoogle Scholar
  20. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 999999, 2121–2159 (2011)MathSciNetGoogle Scholar
  21. Dupuis, P., Simha, R.: On sampling controlled stochastic approximation. IEEE Trans. Autom. Control 36(8), 915–924 (1991)MATHMathSciNetCrossRefGoogle Scholar
  22. El Karoui, N.: Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Stat. 36, 2757–2790 (2008)MATHMathSciNetCrossRefGoogle Scholar
  23. Fabian, V.: On asymptotic normality in stochastic approximation. Ann. Math. Stat. 39, 1327–1332 (1968)MATHMathSciNetCrossRefGoogle Scholar
  24. Fabian, V.: Asymptotically efficient stochastic approximation; the RM case. Ann. Stat. 1, 486–495 (1973)MATHMathSciNetCrossRefGoogle Scholar
  25. Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. A 222, 309–368 (1922)MATHCrossRefGoogle Scholar
  26. Fisher, R.A.: Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh (1925a)Google Scholar
  27. Fisher, R.A.: Theory of statistical estimation. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol. 22, pp. 700–725. Cambridge University Press, Cambridge (1925b)Google Scholar
  28. Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)MATHCrossRefGoogle Scholar
  29. George, A.P., Powell, W.B.: Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learn. 65(1), 167–198 (2006)CrossRefGoogle Scholar
  30. Girolami, M.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B 73(2), 123–214 (2011)MathSciNetCrossRefGoogle Scholar
  31. Gosavi, A.: Reinforcement learning: a tutorial survey and recent advances. INFORMS J. Comput. 21(2), 178–192 (2009)MATHMathSciNetCrossRefGoogle Scholar
  32. Green, P.J.: Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J. R. Stat. Soc. Ser. B 46, 149–192 (1984)MATHGoogle Scholar
  33. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2011)Google Scholar
  34. Hennig, P., Kiefel, M.: Quasi-Newton methods: a new direction. J. Mach. Learn. Res. 14(1), 843–865 (2013)MATHMathSciNetGoogle Scholar
  35. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)MATHMathSciNetCrossRefGoogle Scholar
  36. Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)MATHMathSciNetGoogle Scholar
  37. Huber, P.J., et al.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)MATHCrossRefGoogle Scholar
  38. Huber, P.J.: Robust Statistics. Springer, New York (2011)CrossRefGoogle Scholar
  39. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)Google Scholar
  40. Kivinen, J., Warmuth, M.K.: Additive versus exponentiated gradient updates for linear prediction. In: Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing, pp. 209–218Google Scholar
  41. Kivinen, J., Warmuth, M.K., Hassibi, B.: The p-norm generalization of the lms algorithm for adaptive filtering. IEEE Trans. Signal Process. 54(5), 1782–1793 (2006)CrossRefGoogle Scholar
  42. Korattikara, A., Chen, Y., Welling, M.: Austerity in mcmc land: cutting the metropolis-hastings budget. In: Proceedings of the 31st International Conference on Machine Learning, pp. 181–189 (2014)Google Scholar
  43. Kulis, B., Bartlett, P.L.: Implicit online learning. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 575–582 (2010)Google Scholar
  44. Lai, T.L., Robbins, H.: Adaptive design and stochastic approximation. Ann. Stat. 7, 1196–1221 (1979)MATHMathSciNetCrossRefGoogle Scholar
  45. Lange, K.: A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. B 57, 425–437 (1995)MATHGoogle Scholar
  46. Lange, K.: Numerical Analysis for Statisticians. Springer, New York (2010)MATHCrossRefGoogle Scholar
  47. Le, C., Bottou Yann, L., Bottou, L.: Large scale online learning. Adv. Neural Inf. Process. Syst. 16, 217 (2004)Google Scholar
  48. Lehmann, E.H., Casella, G.: Theory of Point Estimation, 2nd edn. Springer, New York (2003)Google Scholar
  49. Li, L.: A worst-case comparison between temporal difference and residual gradient with linear function approximation. In: Proceedings of the 25th International Conference on Machine Learning, ACM, pp. 560–567Google Scholar
  50. Liu, Z., Almhana, J., Choulakian, V., McGorman, R.: Online em algorithm for mixture with application to internet traffic modeling. Comput. Stat. Data Anal. 50(4), 1052–1071 (2006)MATHMathSciNetCrossRefGoogle Scholar
  51. Ljung, L., Pflug, G., Walk, H.: Stochastic Approximation and Optimization of Random Systems, vol. 17. Springer, New York (1992)Google Scholar
  52. Martin, R.D., Masreliez, C.: Robust estimation via stochastic approximation. IEEE Trans. Inf. Theory 21(3), 263–271 (1975)MATHMathSciNetCrossRefGoogle Scholar
  53. Murata, N.: A Statistical Study of On-line Learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge (1998)Google Scholar
  54. Nagumo, J.-I., Noda, A.: A learning method for system identification. IEEE Trans. Autom. Control 12(3), 282–287 (1967)CrossRefGoogle Scholar
  55. National Research Council: Frontiers in Massive Data Analysis. The National Academies Press, Washington, DC (2013)Google Scholar
  56. Neal, R.M., Hinton, G.E.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, pp. 355–368. Springer, New York (1998)Google Scholar
  57. Neal, R.: Mcmc Using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo 2 (2011)Google Scholar
  58. Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, Chichester (1983)Google Scholar
  59. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)MATHMathSciNetCrossRefGoogle Scholar
  60. Nevelson, M.B., Khasminskiĭ, R.Z.: Stochastic Approximation and Recursive Estimation, vol. 47. American Mathematical Society, Providence (1973)Google Scholar
  61. Nowlan, S.J.: Soft Competitive Adaptation: Neural Network Learning Algorithms Based on Fitting Statistical Mixtures. Carnegie Mellon University, Pittsburgh (1991)Google Scholar
  62. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)Google Scholar
  63. Pillai, N.S., Smith, A.: Ergodicity of approximate mcmc chains with applications to large data sets. arXiv preprint http://arxiv.org/abs/1405.0182 (2014)
  64. Polyak, B.T., Tsypkin, Y.Z.: Adaptive algorithms of estimation (convergence, optimality, stability). Autom. Remote Control 3, 74–84 (1979)Google Scholar
  65. Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)Google Scholar
  66. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)MATHMathSciNetCrossRefGoogle Scholar
  67. Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)MATHMathSciNetCrossRefGoogle Scholar
  68. Rosasco, L., Villa, S., Công Vũ, B.: Convergence of stochastic proximal gradient algorithm. arXiv preprint http://arxiv.org/abs/1403.5074, 2014
  69. Ruppert, D.: Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering (1988)Google Scholar
  70. Ryu, E.K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Working paper. http://web.stanford.edu/~eryu/papers/spi.pdf (2014)
  71. Sacks, J.: Asymptotic distribution of stochastic approximation procedures. Ann. Math. Stat. 29(2), 373–405 (1958)MATHMathSciNetCrossRefGoogle Scholar
  72. Sakrison, D.J.: Efficient recursive estimation; application to estimating the parameters of a covariance function. Int. J. Eng. Sci. 3(4), 461–483 (1965)MATHMathSciNetCrossRefGoogle Scholar
  73. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine Learning, ACM, pp. 791–798 (2007)Google Scholar
  74. Sato, M.-A., Ishii, S.: On-line em algorithm for the normalized Gaussian network. Neural Comput. 12(2), 407–432 (2000)CrossRefGoogle Scholar
  75. Sato, I., Nakagawa, H.: Approximation analysis of stochastic gradient langevin dynamics by using Fokker-Planck equation and ito process. JMLR W&CP 32(1), 982–990 (2014)Google Scholar
  76. Schapire, R.E., Warmuth, M.K.: On the worst-case analysis of temporal-difference learning algorithms. Mach. Learn. 22(1–3), 95–121 (1996)MATHGoogle Scholar
  77. Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. arXiv preprint. http://arxiv.org/abs/1206.1106, 2012
  78. Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-Newton method for online convex optimization. In: Meila M., Shen X. (eds.) Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 2, pp. 436–443. San Juan, Puerto Rico (2007)Google Scholar
  79. Slock, D.T.M.: On the convergence behavior of the LMS and the normalized LMS algorithms. IEEE Trans. Signal Process. 41(9), 2811–2825 (1993)Google Scholar
  80. Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)Google Scholar
  81. Tamar, A., Toulis, P., Mannor, S., Airoldi, E.: Implicit temporal differences. In: Neural Information Processing Systems, Workshop on Large-Scale Reinforcement Learning (2014)Google Scholar
  82. Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary latent variables. Adv. Neural Inf. Process. Syst. 19, 1345–1352 (2006)Google Scholar
  83. Titterington, M.D.: Recursive parameter estimation using incomplete data. J. R. Stat. Soc. Ser. B 46, 257–267 (1984)MATHMathSciNetGoogle Scholar
  84. Toulis, P., Airoldi, E.M.: Implicit stochastic gradient descent for principled estimation with large datasets. arXiv preprint http://arxiv.org/abs/1408.2923, 2014
  85. Toulis, P., Airoldi, E., Rennie, J.: Statistical analysis of stochastic gradient methods for generalized linear models. JMLR W&CP 32(1), 667–675 (2014)Google Scholar
  86. Venter, J.H.: An extension of the robbins-monro procedur. Ann. Math. Stat. 38, 181–190 (1967)MATHMathSciNetCrossRefGoogle Scholar
  87. Wang, C., Chen, X., Smola, A., Xing, E.: Variance reduction for stochastic gradient optimization. Adv. Neural Inf. Process. Syst. 26, 181–189 (2013)Google Scholar
  88. Wang, M., Bertsekas, D.P.: Stabilization of stochastic iterative methods for singular and nearly singular linear systems. Math. Oper. Res. 39(1), 1–30 (2013)MathSciNetCrossRefGoogle Scholar
  89. Wei, C.Z.: Multivariate adaptive stochastic approximation. Ann. Stat. 3, 1115–1130 (1987)CrossRefGoogle Scholar
  90. Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688 (2011)Google Scholar
  91. Xu, W.: Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint http://arxiv.org/abs/1107.2490, 2011
  92. Younes, L.: On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics 65(3–4), 177–228 (1999)Google Scholar
  93. Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, p. 116 (2004)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of StatisticsHarvard UniversityCambridgeUSA

Personalised recommendations