1 Optimization scheme

In machine learning and data science, we often encounter problems of the form:

$$\begin{aligned} \min _{\theta \in {\mathbb {R}}^d} \ell (\theta ) \quad \text {for} \quad \ell (\theta ) = \tfrac{1}{n}\textstyle \sum _{j=1}^n \log g_j(\theta ) \end{aligned}$$
(1)

where each log-convex function \(g_j \in C^2({\mathbb {R}}^d)\) corresponds to the loss accrued by an observation or sample at parameter value \(\theta\) for \(1\le j \le n\). Examples include multiple linear regression and maximum likelihood estimation for the exponential family (see Sect. 6 for details). In this paper, we examine an online learning regime where data arrives asynchronously in a stream or \(n\gg 1\) is sufficiently large that samples must be processed in batches.

We consider a sub-sampled Newton method that leverages stochastic estimates for both the gradient and Hessian [19]. At each step \(t\ge 1\), we begin with the previous parameter estimate \(\theta _{t-1}\), obtain a uniform random sample \({\mathcal S_t \subset \{1,\dotsc , n\}}\), and calculate

$$\begin{aligned} f_t&= \textstyle \tfrac{1}{\left|\mathcal S_t\right|} \sum _{j\in {\mathcal S_t}} \nabla \log g_j( \theta _{t-1} ), \end{aligned}$$
(2a)
$$\begin{aligned} Q_t&= \textstyle \tfrac{1}{\left|\mathcal S_t\right|} \sum _{j \in {\mathcal S_t}} \nabla ^2 \log g_j( \theta _{t-1}), \end{aligned}$$
(2b)

where \(\nabla \log g_j = \nabla g_j / g_j\) and \(\nabla ^2 \log g_j = (g_j \nabla ^2 g_j - \nabla g_j( \nabla g_j)^\intercal )/ g_j^2\) denote the gradient and positive-definite Hessian of \(\log g_j\), respectively. In this way, we form a step direction \(-Q_t^{-1}f_t\) using only information available from the current batch \({\mathcal S_t}\). For modern applications, computer hardware limitations often constrain the batch size \(\left|\mathcal S_t\right|\) to be much less than n.

Given the descent direction \(-Q_t^{-1}f_t\), we perform an Armijo-style [5] backtracking line search (see Algorithm 1 for particulars) using the function \(\tfrac{1}{\left|\mathcal S_t\right|} \sum _{j\in {\mathcal S_t}} \log g_j\) to determine a good step size \(0<\lambda _t<1\) prior to updating

$$\begin{aligned} \theta _t = \theta _{t-1} - \lambda _t Q_t^{-1}f_t. \end{aligned}$$
(3)

Proceeding in this way, each optimization step performs a Newton update on a subsampled surrogate of the true objective.

Given some initialization \(\theta _0\), this method produces a sequence of estimates \(\theta _1, \theta _2, \dotsc\) that under certain conditions tends towards the solution to problem (1). For a precise analysis of this second-order approach to stochastic optimization (in the less restrictive setting that the functions \(g_j\) are convex), see Roosta-Khorasani and Mahoney [65] and Bollapragada, Byrd, and Nocedal [13].

Thesis and outline. Exchanging the full objective function for subsampled versions of it offers computational and practical benefits, but incurs a cost in terms of the reliability of the computed updates. In particular, the sub-sampled estimates (2a) and (2b) may prove quite noisy, hindering progress towards the optimum. This paper adapts a Bayesian filtering strategy in an attempt to mitigate this issue. We begin with a discussion of related work in the next section and then introduce discriminative Bayesian filtering in Sect. 3. We recast the optimization process described in this section as a discriminative filtering problem in Sect. 4, leading to an algorithm that calculates a step direction using the entire history of sub-sampled gradients and Hessians. In Sect. 5, we establish technical conditions under which the proposed algorithm behaves similarly to Polyak’s momentum. In Sect. 6, we compare the standard approach outlined in this section to our proposed, filtered method using an online linear regression problem with synthetic data, before drawing conclusions in Sect. 7.

2 Related work

In this section, we provide a brief overview of related work, separated thematically into paragraphs.

Filtering methods have previously been applied to stochastic optimization problems, with notable success. Houlsby and Blei [33] characterized online stochastic variational inference [31] as a (non-discriminative) filtering problem using the standard Kalman filter where the covariance matrix was restricted to be isotropic and demonstrated promising results training both latent Dirichlet allocation [11, 30, 63] and Bayesian matrix factorization models [29]. For least squares problems, Bertsekas [9] demonstrated how the extended Kalman filter could be applied to form batch-based updates. More recently, Akyıldız [3] and Liu [42] developed filtered versions of the incremental proximal method [10]. In a more general setting, Stinis [70] phrased stochastic optimization as a filtering problem and proposed particle filter-based inference [77, 78].

While momentum and momentum-like approaches have been thoroughly explored for stochastic problems in general [14, 24, 35, 62, 66, 67, 69, 71] and for the stochastic Newton method when restricted to solving linear systems [43], momentum for more general cases of stochastic Newton has received comparatively little attention.

As the parameter space becomes high-dimensional, the computational costs for inverting the Hessian matrix grow cubically. Hessian Free approaches entirely circumvent the construction and subsequent inversion of the Hessian [47, 52, 75] by directly computing matrix-vector products using the conjugate gradient method or the Pearlmutter trick [58]. Berahas, Bollapragada, and Nocedal [7] explore sketching [1, 44, 55, 56, 59] as an alternative to sub-sampling.

Backtracking line search plays an important role in Roosta-Khorasani and Mahoney’s [65] convergence results and inspired the use of line search in this work. In contrast to the more traditional stochastic approximation results that stipulate \(\sum _{t=1}^\infty a_t = \infty\) and \(\sum _{t=1}^\infty a_t^2 < \infty\) where \(a_t>0\) are step sizes [64], many variants of stochastic Newton use either line search or fixed step lengths. In the stochastic setting, line search remains an area of active research [8, 45, 57, 73].

Other recent innovations for the stochastic Newton method include non-uniform [76] and adaptive sampling strategies [12, 26] for the batches \({\mathcal S_t}\), low-rank approximation for the sub-sampled Hessians [25], and alternate formulations for the inverse Hessian [2]. Any of these approaches could be applied to the method we develop in the remainder of this paper.

figure a

3 Discriminative Bayesian filtering

Consider a state-space model relating a sequence \(Z_{1:t} = Z_1,Z_2, \dotsc ,Z_t\) of latent random variables to a corresponding sequence of observed measurements \(X_{1:t} = X_1,X_2,\dotsc ,X_t\) according to the Bayesian network:

figure b

At each successive point t in time, filtering aims to infer the current hidden state \(Z_t\) given all currently available measurements \(X_{1:t}\). We find that such an estimate often provides more accurate and more stable performance than an estimate for \(Z_t\) given only the most recent measurement \(X_t\). This is expected, as we know that conditioning reduces entropy [21, thm 2.6.5] and that the law of total varianceFootnote 1 implies

$$\begin{aligned} {\mathbb {E}}[{\mathbb {V}}[Z_t |X_{1:t}]] \le {\mathbb {V}}[Z_t |X_t] \end{aligned}$$

where we use \({\mathbb {E}}[\cdot ]\) and \({\mathbb {V}}[\cdot ]\) to denote expectation and (co)variance, respectively. In particular, conditioning also reduces variance on average.

In Bayesian filtering, inference takes a distributional form. Given a state model \(p(z_t|z_{t-1})\) that describes the evolution of the latent state and a measurement model \(p(x_t|z_t)\) that relates the current observation and current latent state, Bayesian filtering methods iteratively infer or approximate the posterior distribution \(p(z_t | x_{1:t})\) of the current latent state given all available measurements at the current point in time. To this end, the Chapman–Kolmogorov recursion

$$\begin{aligned} p(z_t |x_{1:t}) \propto p(x_t|z_t) \int p(z_t|z_{t-1}) p(z_{t-1}|x_{1:t-1}) \, dz_{t-1} \end{aligned}$$
(4)

relates the current and previous posteriors in terms of the state and measurement models, up to a constant depending on the observations alone. The Kalman filter provides a quintessential example, where both the state and measurement models are chosen to be linear and Gaussian [37]. For nonlinear Gaussian models, the extended Kalman filter performs linearization prior to applying the standard Kalman updates. In general, the integrals required to compute (4) prove intractable. Assumed density filters employ variational methods to fit models to a tractable family of distributions [34, 40], sigma-point filters such as the unscented Kalman filter apply quadrature [36, 51], and particle filters perform Monte Carlo integration [27, 28]. For comprehensive surveys of Bayesian filtering, consult Chen [20] and Särkkä [68].

In some cases, it may be easier to calculate or approximate \(p(z_t|x_t)\) than the typical observation model \(p(x_t|z_t)\). In order to use the conditional distribution of latent states given observations for filtering, we may apply Bayes’ rule to find that \(p(x_t|z_t) \propto p(z_t|x_t)/p(z_t)\) up to a constant in \(x_t\) and re-write (4) as

$$\begin{aligned} p(z_t |x_{1:t}) \propto \frac{p(z_t|x_t)}{p(z_t)} \int p(z_t|z_{t-1}) p(z_{t-1}|x_{1:t-1}) \, dz_{t-1}. \end{aligned}$$
(5)

We characterize discriminative filtering frameworks as those that exchange a generative model (in the sense of Ng and Jordan [54]) for the ability to use \(p(z_t|x_t)\) for inference. Well-known examples include maximum entropy Markov models [49] and conditional random fields [41], with applications including natural language processing (ibid.), gene prediction [22, 74], human motion tracking [38, 72], and neural modeling [6, 16].

In this paper, we focus on the Discriminative Kalman Filter (DKF) [17, 18] that specifies both the state and discriminative observation models as Gaussian:

$$\begin{aligned} p(z_t|z_{t-1})& = \eta _d(z_t; \, Az_{t-1},\Gamma ), \end{aligned}$$
(6)
$$\begin{aligned} p(z_t|x_t)& = \eta _d(z_t; \, f(x_t),Q(x_t)), \end{aligned}$$
(7)

where \(A\in {\mathbb {R}}^{d\!\times \!d}\) and \(\Gamma \in {\mathbb {S}}_d\) parameterize the Kalman state model for the set \({\mathbb {S}}_d\) of valid \(d\!\times \!d\) covariance matrices, \(f:{\mathcal {X}\rightarrow {\mathbb {R}}^d}\) and \(Q:{\mathcal {X}\rightarrow {\mathbb {S}}_d}\) parameterize the discriminative model for an abstract space \({\mathcal {X}}\), and \(\eta _d(\cdot ; \mu , \Sigma )\) denotes the d-dimensional Gaussian density function with mean \(\mu \in {\mathbb {R}}^d\) and covariance \(\Sigma \in {\mathbb {S}}_d\). With initialization \(p(z_0)=\eta _d(z_0;\mathbf {0}, S)\) where \(S\in {\mathbb {S}}_d\) satisfies \(S=ASA^\intercal + \Gamma\), the unconditioned latent process is stationary. The function f here may be non-linear. If the posterior at time \(t-1\),

$$\begin{aligned} p(z_{t-1}|x_{1:t-1}) \approx \eta _d( z_{t-1}; \mu _{t-1}, \Sigma _{t-1}), \end{aligned}$$
(8)

is approximately Gaussian then it follows from the model (67) and the recursion (5) that the posterior at time t,

$$\begin{aligned} p(z_{t}|x_{1:t}) \approx \eta _d( z_t; \mu _t, \Sigma _t), \end{aligned}$$
(9)

can also be approximated as Gaussian, where

$$\begin{aligned} R_{t-1}&= A\Sigma _{t-1}A^\intercal +\Gamma , \end{aligned}$$
(10a)
$$\begin{aligned} \Sigma _t&= (Q(x_t)^{-1}+R_{t-1}^{-1}-S^{-1})^{-1} , \end{aligned}$$
(10b)
$$\begin{aligned} \mu _t&= \Sigma _t(Q(x_t)^{-1}f(x_t) + R_{t-1}^{-1}A\mu _{t-1}). \end{aligned}$$
(10c)

In fact, this approximation is exact when the matrix \(Q(x_t)^{-1} - S^{-1}\) is positive definite [18, p. 973]; if this fails to be the case, the DKF specifies \(\Sigma _t = (Q(x_t)^{-1}+R_{t-1}^{-1})^{-1}\) in place of (10b). In this way, closed-form updates for the DKF’s posterior require only the inversion and multiplication of \(d\!\times \!d\) matrices, upon evaluation of the functions f and Q.

In Sect. 1, we considered an optimization scheme that iteratively obtains sub-sampled values for the objective function, its gradient, and Hessian in a small neighborhood around the current parameter value. It then estimates an optimal direction of descent given only the current observations and parameter value. In this section, we showed how a discriminative Gaussian approximation (7) can be used with a latent state model (6) to consider the entire history of observations when performing inference. In the next section, we will apply this discriminative filtering process to forming updates for the stochastic Newton method.

4 Stochastic optimization as a filtering problem

When the batch size \(\left|\mathcal {S}_t\right|\) is small, the stochastic estimates obtained for the gradient (2a) and Hessian (2b) of \(\ell\) may prove to be quite noisy. To remedy this, we now outline a filtering method that incorporates multiple batches’ worth of noisy measurement information to inform its estimate for \(Z_t = \nabla \ell (\theta _{t-1})\). At each step, we let \(X_t\) denote the current parameter value along with the function, gradient, and Hessian of \(\tfrac{1}{\left|\mathcal S_t\right|} \sum _{j\in {\mathcal S_t}} \log g_j\) obtained from the uniform random sample \({\mathcal S_t}\) in a neighborhood of \(\theta _{t-1}\). In order to iteratively update our distributional estimate for \(Z_t\) given all available observations using the discriminative Kalman filter (DKF) as described in the previous section, we must first specify a discriminative measurement model and state model of the required form. After formulating these models, we then describe how to use the resulting filtered estimates in our optimization framework.

4.1 Measurement model

Given some observation \(x_t\) of the random variable \(X_t\), which in this case corresponds to local information for the sub-sampled function \(\tfrac{1}{\left|\mathcal S_t\right|} \sum _{j\in {\mathcal S_t}} \log g_j\) in a neighborhood of \(\theta _{t-1}\), we form a Gaussian approximation for the conditional distribution of \(Z_t\) as

$$\begin{aligned} p(z_t|x_t) \approx \eta _d(z_t; \, f_t, Q_t) \end{aligned}$$
(11)

where the mean \(f_t\) and covariance \(Q_t\) refer to (2a) and (2b), respectively. While other authors have justified similar Gaussian approximations using the sub-sampled gradient via the Central Limit Theorem [45, 46], we stress that we expect \(Q_t \approx \nabla ^2 \ell (\theta _t)\) in the large-sample setting: in particular, we do not intend our covariance estimate \(Q_t\) to tend to zero when \(d=1\) (or toward singularity when \(d>1\)).Footnote 2

If the functions \(g_j(\theta )\) in (1) are themselves probability density functions, so that we seek \(\theta\) that minimizes the observed negative log likelihood

$$\begin{aligned} \log g_j(\theta ) = -\log p(\theta ,\psi _j) \end{aligned}$$
(12)

for \(\psi _1,\psi _2,\dotsc , \psi _n \sim ^\text {i.i.d.} p_{\theta _*}\) and some underlying distribution \(p_{\theta _*}\) where \(\theta _*\) denotes the true parameter in a family \(\{p_\theta \}_{\theta \in \Theta }\) of parametrized distributions, then with \(\ell (\theta ,\psi )\) defined analogously to (1) we have

$$\begin{aligned} {\mathbb {E}}[f_t]= & {} - \tfrac{1}{\left|\mathcal S_t\right|} {\mathbb {E}}\big [\textstyle \sum _{j\in \mathcal S_t} \nabla _\theta \log p(\theta ,\psi _j) \vert _{\theta =\theta _{t-1}}\big ]\nonumber \\= & {} -{\mathbb {E}}_{\Psi \sim p_{\theta _*}}[\nabla _\theta \log p(\theta , \Psi ) \vert _{\theta =\theta _{t-1}}] = {\mathbb {E}}_{\Psi \sim p_{\theta _*}}[\nabla _\theta \ell (\theta , \Psi ) \vert _{\theta =\theta _{t-1}}] \end{aligned}$$
(13)

so that \(f_t\) from (2a) with the functions \(\log g_j(\theta )\) as specified in (12) is an unbiased Monte Carlo estimate for the expected gradient of the objective. Furthermore, the Fisher information equality implies

$$\begin{aligned} {\mathbb{V}}_{{\Psi \sim p_{{\theta _{*} }} }} \left[ {\nabla _{\theta } \log p(\theta ,\Psi )|_{{\theta = \theta _{*} }} } \right] = & - {\mathbb{E}}_{{\Psi \sim p_{{\theta _{*} }} }} \left[ {\nabla _{\theta }^{2} \log p(\theta ,\Psi )|_{{\theta = \theta _{*} }} } \right] \\ = & {\mathbb{E}}_{{\Psi \sim p_{{\theta _{*} }} }} \left[ {\nabla _{\theta }^{2} \ell (\theta ,\Psi )|_{{\theta = \theta _{*} }} } \right] \\ \end{aligned}$$
(14)

so that for \(\theta _{t-1}\) near the the optimum \(\theta _* =\arg \min _\theta \{\ell (\theta )\}\), we have

$$\begin{aligned} {\mathbb {E}}[Q_t] \approx {\mathbb {V}}_{\Psi \sim p_{\theta _*}}[\nabla _\theta \log p(\theta , \Psi ) \vert _{\theta =\theta _*}] = {\mathbb {V}}_{\Psi \sim p_{\theta _*}}[\nabla _\theta \ell (\theta , \Psi ) \vert _{\theta =\theta _*}] \end{aligned}$$
(15)

and the sub-sampled Hessian \(Q_t\) from (2b) under the specification (12) should form a reasonable approximation to the variance of the gradient. In this case, the step direction \(-Q_t^{-1}f_t\) takes the form of the natural gradient [4, 48].

4.2 State model

We want our latent state estimate to evolve continuously, so we specify the state model

$$\begin{aligned} p(z_t | z_{t-1}) \approx \eta _d(z_t; \, \alpha z_{t-1}, \beta I_d) \end{aligned}$$
(16)

for \(0< \alpha < 1\) and \(0<\beta\), and define \(S = \tfrac{\beta }{1-\alpha ^2}I_d\) where \(I_d\) is the d-dimensional identity matrix. This autoregressive model with a single lag allows the previous gradient estimate to influence the current gradient estimate. In particular, we stipulate a correlation of \(\alpha\) between \(z_t(i)\) and \(z_{t-1}(i)\) where z(i) denotes the i-th coordinate of z.

4.3 Resulting estimates and filtered optimization scheme

We now filter the state-space model described above to obtain iterative estimates for the posterior distribution. Starting with the previous approximation for the optimal descent direction given all available observations,

$$\begin{aligned} p(z_{t-1} | x_{1:t-1}) \approx \eta _d(z_{t-1}; \mu _{t-1}, \Sigma _{t-1}), \end{aligned}$$

we may apply the DKF to recursively approximate the next posterior \(p(z_{t} | x_{1:t}) \approx \eta _d(z_t; \mu _{t}, \Sigma _{t})\) as Gaussian under (11) and (16), where

$$\begin{aligned} \Sigma _t&= ( Q_t^{-1} + (\alpha ^2 \Sigma _{t-1} + \beta I_d)^{-1} - S^{-1} )^{-1}, \end{aligned}$$
(17a)
$$\begin{aligned} \mu _t&= \Sigma _t(Q_t^{-1} f_t + (\alpha ^2 \Sigma _{t-1} + \beta I_d)^{-1} \alpha \mu _{t-1}), \end{aligned}$$
(17b)

if \(Q_t^{-1}- S^{-1}\) is positive definite; otherwise

$$\begin{aligned} \Sigma _t = (Q_t^{-1} + (\alpha ^2 \Sigma _{t-1} + \beta I_d)^{-1})^{-1}. \end{aligned}$$

This recursive approximation inspires a novel optimization scheme similar in nature to the standard stochastic Newton method introduced in Sect. 1, where we replace the unfiltered estimates \(f_t\) and \(Q_t\) with our filtered estimates \(\mu _t\) and \(\Sigma _t\), respectively, at each update step. Given the same problem (1) and initialization, at each step \(t\ge 1\), we now take the search direction \(-\Sigma _t^{-1} \mu _t\). We then perform an Armijo-style backtracking line search using \(\tfrac{1}{\left|\mathcal S_t\right|} \sum _{j\in {\mathcal S_t}} \log g_j\). See Algorithm 2 for pseudo-code and complete details.

Calculating the posterior \(p(z_t | x_{1:t})\) requires only minimal additional computational and storage costs in comparison to the standard stochastic Newton method. We introduce two hyperparameters, \(\alpha\) and \(\beta\), to control the influence of previous observations. Intuitively, the impact of previous updates should fade over time as our current estimate moves further away from the parameter values associated with the previously-subsampled gradients and Hessians. In the next section, we will make this intuition more precise by outlining conditions on the hyperparameters under which the impact of previous updates decays exponentially.

5 The connection with momentum

We would like to view our updates as analogous to Polyak’s heavy ball momentum [61, 67]. In the context of optimization, momentum allows previous update directions to influence the current update direction, typically in the form of an exponentially-decaying average. This section explores how our filtered approach to optimization results in momentum-like behavior for the step direction.

To this end, we remark that from (17b) we have the recursion

$$\begin{aligned} \Sigma _t^{-1} \mu _t = Q_t^{-1} f_t + M_t \Sigma _{t-1}^{-1} \mu _{t-1}, \end{aligned}$$
(18)

so that the current step direction is the sum of the current Newton update and \(M_t\) times the previous step direction, where we define

$$\begin{aligned} M_t = \alpha (\alpha ^2 \Sigma _{t-1} + \beta I_d)^{-1} \Sigma _{t-1} \end{aligned}$$
(19)

for \(t\ge 2\). In the standard formulation of momentum, a scalar \(0<m<1\) or diagonal matrix commonly takes the place of \(M_t\), so that momentum acts in a coordinate-wise manner. In contrast, our matrix \(M_t\) generally contains off-diagonal elements. To view our updates in the context of momentum, we need to establish matrix-based conditions for \(M_t\) to dampen the impact of previous estimates over time.

For any positive-definite, Hermitian matrix \(M\in {\mathbb {R}}^{d\!\times \!d}\), let \(\lambda _{\min }(M)\) and \(\lambda _{\max }(M)\) denote its smallest and largest eigenvalues, respectively. With this notation, \(\rho (M) = \lambda _{\max }(M)\) corresponds to the spectral norm (as all eigenvalues are positive), and we have

Proposition 1

Suppose there exist \(0< \Lambda _1\le \Lambda _d\) such that \(\Lambda _1\le \lambda _{\min }(\Sigma _t)\) and \(\lambda _{\max }(\Sigma _t)\le \Lambda _d\) for all t. If \(0< \alpha < 1\) and \(0<\beta\) are chosen to satisfy \(\alpha \Lambda _d< \alpha ^2\Lambda _1 + \beta\), then \(\rho (M_t) < 1\) for all t.

Proof

As the spectral norm is sub-multiplicative and \(\lambda _{\max }(M^{-1}) = 1/\lambda _{\min }(M)\) for positive-definite matrices, we have from (19) that

$$\begin{aligned} \rho (M_t) \le \alpha \cdot \rho \big ((\alpha ^2 \Sigma _{t-1} + \beta I_d)^{-1}\big ) \cdot \rho (\Sigma _{t-1}) \le \alpha \Lambda _d / \lambda _{\min }(\alpha ^2 \Sigma _{t-1} + \beta I_d) \end{aligned}$$

where Weyl’s inequality [32, thm. 4.3.1] implies

$$\begin{aligned} \lambda _{\min }(\alpha ^2 \Sigma _{t-1} + \beta I_d) \ge \lambda _{\min }(\alpha ^2 \Sigma _{t-1}) + \lambda _{\min }(\beta I_d) \ge \alpha ^2 \Lambda _1 + \beta . \end{aligned}$$

Combining the above two inequalities allows us to deduce

$$\begin{aligned} \rho (M_t) \le \frac{\alpha \Lambda _d}{\alpha ^2\Lambda _1 + \beta } < 1 \end{aligned}$$
(20)

and conclude. \(\square\)

We may reformulate the recursion (18) with initialization \(\Sigma _1^{-1} \mu _1= Q_1^{-1} f_1\) as

$$\begin{aligned} \Sigma _t^{-1} \mu _t = \textstyle \sum _{i = 1}^t \big ( \prod _{k= i+1}^t M_k \big )Q_i^{-1} f_i. \end{aligned}$$
(21)

Under the conditions of the proposition, for each \(i\ge 1\), we have from (20) that

$$\begin{aligned} \rho ( \textstyle \prod _{k= i+1}^t M_k) \le \textstyle \prod _{k= i+1}^t \rho ( M_k) \le \big ( \tfrac{\alpha \Lambda _d}{\alpha ^2\Lambda _1 + \beta } \big )^{t-i} \rightarrow 0 \end{aligned}$$

as \(t\rightarrow \infty\), where

$$\begin{aligned} \left\Vert \big ( \textstyle \prod _{k= i+1}^t M_k \big )Q_i^{-1} f_i\right\Vert _2 \le \rho \big (\textstyle \prod _{k= i+1}^t M_k \big ) \left\Vert Q_i^{-1} f_i\right\Vert _2 \end{aligned}$$

so that the impact of older updates exponentially decays over time, as one would expect from momentum.

We also note that, if \(0<\Lambda _1 \le \Lambda _d\) exist, then \(0< \alpha < 1\) and \(0<\beta\) may always be chosen to satisfy \(\alpha \Lambda _d< \alpha ^2\Lambda _1 + \beta\). For example, we may let \(\alpha =1/2\) and \(\beta =\Lambda _d\).

figure c

6 Illustrated example: online linear regression

In this section, we compare the filtered method as described in Algorithm 2 to the standard, unfiltered method as described in Sect. 1 on a simple optimization problem of the form (1) that we now describe.

In maximum likelihood estimation, minimizing the negative log-likelihood for a set of i.i.d. samples from a log-concave distribution produces an average of functions, each convex in the parameter of interest. We consider the problem of estimating a vector of coefficients for a discriminative linear regression model; i.e. given data \(y_j \in {\mathbb {R}}\) and \(x_j \in {\mathbb {R}}^{d}\), \(1\le j \le n\), we suppose

$$\begin{aligned} p_\theta (y_1,\dotsc y_n | x_1,\dotsc ,x_n) = \textstyle \prod _{j=1}^n \eta (y_j; \theta ^\intercal x_j, 1), \end{aligned}$$
(22)

where \(\theta \in {\mathbb {R}}^{d}\) denotes a column vector of parameters. We aim to minimize the negative log likelihood, which can be written up to a multiplicative constant as

$$\begin{aligned} \ell (\theta ) = \tfrac{1}{n}\textstyle \sum _{j=1}^n \log g_j(\theta ), \text { where } \log g_j(\theta ) = (y_j - \theta ^\intercal x_j )^2/2 \end{aligned}$$
(23)

so that \(\nabla \log g_j(\theta ) = (x_j x_j^\intercal )\theta -x_j y_j\) and \(\nabla ^2 \log g_j(\theta )=x_j x_j^\intercal\).

6.1 Methodology

We performed a computer simulation with \(n=100\) and \(d=2\). For \(1\le j \le n\), we sampled \(x_j \sim ^\text {i.i.d.} {\mathcal {N}}\big (({\begin{smallmatrix} 0 \\ 0 \end{smallmatrix}}),({\begin{smallmatrix} 1.0 &{} 0.1 \\ 0.1 &{} 1.0 \end{smallmatrix}}) \big )\) and \(\epsilon _j \sim ^\text {i.i.d.} {\mathcal {N}}(1,1)\), where \({\mathcal {N}}(m,V)\) denotes a Gaussian random variable with mean m and covariance V. We set \(y_j= \theta ^\intercal x_j + \epsilon _j\) for each j. In this way, the conditional distribution of the \(y_i\) respects (22), and the minimizer of (23) corresponds to the maximum likelihood estimate (MLE) for the parameter \(\theta\). Due to our choice of small n, the true optimum \(\theta _*= \arg \min _\theta \ell (\theta )\) can be calculated exactly and used to help evaluate performance. In practical applications of stochastic Newton, we would generally expect n to be much larger.

For the purposes of comparison, we ran 1000 independent paired trials starting from the same initialization. The trials performed 30 optimization steps for each method. Within each trial, the two methods received the same 5 indices \({\mathcal S_t}\), sampled uniformly at random with replacement from \(\{1,\dotsc , n\}\) at each step. These methods then garnered gradient and Hessian information from the same subsampled function at their respective current parameter values to form each subsequent update. We selected \(\alpha =0.9\) and \(\beta =0.2\) for the filtered algorithm, but note that we generally expect these parameters to be problem-dependent.

We performed our comparisons on a 2020 MacBook Pro (Apple M1 Chip; 16 GB LPDDR4 Memory) using Python (v.3.10.2) and its Numpy package (v.1.22.3). We include code to reproduce the results and figures that follow as part of our supplementary material.

6.2 Results

We plot the evolution of three randomly selected paths for both methods in Fig. 1 and present a graphical summary of the aggregate results of 1000 independent trials in Fig. 2. We note that the filtered method tends to reach a neighborhood of the optimum in around 10 steps, while the unfiltered method commonly takes 30 steps or more (see Fig. 2a).

Fig. 1
figure 1

We plot three trajectories for the unfiltered and filtered methods (acting on the same samples) starting from the grey dot, and heading towards the global optimum (the red triangle) for the full objective function \(\ell\) in (1)

Prior to reaching a neighborhood of the optimum (where, according to Fig. 2b the function \(\ell\) seems to flatten out), the filtered estimates appear to be smoother than those of the unfiltered estimates (see Fig. 1). We make this observation more numerically precise by considering the signed angular difference (in radians) between the optimal descent direction and the calculated step direction before and after the filtering process. We record the mean square of this angular error (MSE) in Table 1 for the crucial first few iterations, where both methods are taking their largest steps, and find that filtering helps to reduce MSE appreciably. As the squared bias tends to be small (\(\le 0.010\) for both methods over the first 5 steps), we see a corresponding reduction in the variance of the error as well.Footnote 3 As discussed in Sect. 3, this reduction in error was one of the original motivations for applying filtering.

We note that filtering allows paths to accelerate early on in their trajectory (see Fig. 2c) and reach a neighborhood of the optimum well before the unfiltered method (see Fig. 2a). Additionally, we monitored \(\rho (M_t)\), as discussed in the previous section, and found that for \(t>5\), \(\rho (M_t)<0.8\) for all 1000 trajectories. Consequently, we observe the exponential decay of the coefficients for each \(Q_i^{-1}f_i\) as written in (21).

Fig. 2
figure 2

Upon sampling 1000 trajectories for both the filtered and unfiltered method starting with the same initialization and receiving the same randomness, we plot the average values ± 2 standard deviations for a the Euclidean distance between the estimate and optimum, b the value of the entire (non-sampled) function \(\ell\) at the current estimate, and c the distance between the current and previous estimate, all versus the step number

Table 1 We report the mean square angular error (over the 1000 trials) for both methods during the first 5 steps. Here, the unfiltered step direction is taken to be \(-Q_t^{-1}f_t\) for \(f_t\) and \(Q_t\) evaluated at the current filtered estimate. As both methods implement line search to select step length, we believe angular error (in radians) may prove more pertinent to successful optimization than other, magnitude-influenced distances. Note that both estimates coincide at step 1

6.3 Exponential families and generalized linear models

We now consider how this section’s linear regression example may be generalized. To this end, we introduce the exponential family [23, 39, 60], consisting of probability distributions of the form

$$\begin{aligned} p(x|\theta ) = h(x) \exp ( \langle \theta , T(x) \rangle - A(\theta )) \end{aligned}$$
(24)

for \(x \in {\mathbb {R}}^\kappa\), natural parameter \(\theta \in {\mathbb {R}}^d\), sufficient statistic \(T: {\mathbb {R}}^\kappa \rightarrow {\mathbb {R}}^d\), log-normalizer \(A: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\), and non-negative \(h:{\mathbb {R}}^\kappa \rightarrow {\mathbb {R}}\). (Note that our notation differs from that of most texts: authors typically let \(\eta\) denote the natural parameter, but we use \(\theta\) to maintain the notation from previous sections.) Given i.i.d. samples \(x_1,\dotsc ,x_n\) from such a distribution, the MLE for \(\theta\) can be characterized as a solution to (1) using the negative log likelihood, where

$$\begin{aligned} \log g_j(\theta ) := -\log p(x_j|\theta ) = A(\theta ) -\log h(x_j) - \langle \theta , T(x_j) \rangle \end{aligned}$$

with

$$\begin{aligned} \nabla \log g_j(\theta ) =\nabla _\theta A(\theta ) - T(x_j) ={\mathbb {E}}_{Y\sim p_\theta } [T(Y)] - T(x_j) \end{aligned}$$
(25)

and

$$\begin{aligned} \nabla ^2_\theta \log g_j(\theta ) = \nabla ^2_\theta A(\theta ) ={\mathbb {V}}_{Y\sim p_\theta } [T(Y)]. \end{aligned}$$
(26)

In particular, at each optimization step, the gradient and Hessian of each \(g_j(\theta _{t-1})\) will always be the expectation and variance, respectively, of \(T(Y) - T(x_j)\) where \(Y\sim p_{\theta _{t-1}}\).

Generalized linear models [53] with canonical response functions model conditional distributions using the exponential family. For \(y_j \in {\mathbb {R}}\) and \(x_j \in {\mathbb {R}}^{d}\), \(1\le i \le n\), and \(\theta \in {\mathbb {R}}^{d}\) we write

$$\begin{aligned} p(y_j | x_j, \theta ) =h(y_j) \exp ( \langle \eta _j, T(y_j) \rangle - A(\eta _j)), \qquad \text {where } \eta _j = \theta ^\intercal x_j, \end{aligned}$$
(27)

so that the MLE for \(\theta\) again solves (1) with

$$\begin{aligned} \log g_j(\theta ) := -\log p(y_j | x_j, \theta ) = A(\theta ^\intercal x_j) -\log h(y_j) - \langle \theta ^\intercal x_j, T(y_j) \rangle . \end{aligned}$$
(28)

Applying the chain rule to (25) and (26) then yields \(\nabla _\theta \log g_j(\theta ) =x_j({\mathbb {E}}[T(Y) | x_j, \theta ] - T(y_j))\) and \(\nabla ^2_\theta \log g_j(\theta )=(x_j x_j^\intercal ){\mathbb {V}}_{Y\sim p_\theta } [T(Y)]\). Thus, our algorithm may readily be applied to find the MLE for models of the above form, with a slight perturbation to the Hessian to ensure positive definiteness.

For a more standard presentation using the overdispersed exponential family, see McCullagh and Nelder [50].

7 Conclusions and future directions of research

The stochastic Newton algorithm uses subsampled gradients and Hessians to iteratively approximate an optimal step direction for batch-based optimization. When the batch size is small, the errors of these subsampled estimates may hinder progress towards the minimum. In this work, we applied a Bayesian filtering method with a discriminative observation model to filter the sequences of gradients and Hessians. We established conditions for the resulting optimization algorithm to behave similarly to Polyak’s momentum, allowing the impact of older updates to fade over time. We illustrated how our method improves performance on a simple example and discussed how the algorithm can be applied more generally to inference for the exponential family.

In the future, we would like to consider possible solutions to two main drawbacks of our approach as currently formulated. In many practical applications, the high dimensionality of the parameter \(\theta\) causes maintaining and inverting the Hessian matrix to be prohibitively expensive. Hessian free methods and the large body of research on quasi-Newton methods [15] may offer some help here. Secondly, from a theoretical perspective, our method would benefit from algorithm termination conditions and associated convergence results. The results of Roosta-Khorasani and Mahoney [65, thm. 4] and Bollapragada, Byrd, and Nocedal [13, thm. 2.2] are most germane to our work, but further modifications would be necessary.

We believe that stochastic optimization provides a natural setting for sequential Bayesian inference and anticipate further advances in this direction.