Discriminative Bayesian filtering lends momentum to the stochastic Newton method for minimizing log-convex functions

To minimize the average of a set of log-convex functions, the stochastic Newton method iteratively updates its estimate using subsampled versions of the full objective's gradient and Hessian. We contextualize this optimization problem as sequential Bayesian inference on a latent state-space model with a discriminatively-specified observation process. Applying Bayesian filtering then yields a novel optimization algorithm that considers the entire history of gradients and Hessians when forming an update. We establish matrix-based conditions under which the effect of older observations diminishes over time, in a manner analogous to Polyak's heavy ball momentum. We illustrate various aspects of our approach with an example and review other relevant innovations for the stochastic Newton method.


Optimization scheme
In machine learning and data science, we often encounter problems of the form: where each log-convex function g j ∈ C 2 (ℝ d ) corresponds to the loss accrued by an observation or sample at parameter value for 1 ≤ j ≤ n . Examples include multiple linear regression and maximum likelihood estimation for the exponential family (see (1) Sect. 6 for details). In this paper, we examine an online learning regime where data arrives asynchronously in a stream or n ≫ 1 is sufficiently large that samples must be processed in batches. We consider a sub-sampled Newton method that leverages stochastic estimates for both the gradient and Hessian [19]. At each step t ≥ 1 , we begin with the previous parameter estimate t−1 , obtain a uniform random sample S t ⊂ {1, … , n} , and calculate where ∇ log g j = ∇g j ∕g j and ∇ 2 log g j = (g j ∇ 2 g j − ∇g j (∇g j ) ⊺ )∕g 2 j denote the gradient and positive-definite Hessian of log g j , respectively. In this way, we form a step direction −Q −1 t f t using only information available from the current batch S t . For modern applications, computer hardware limitations often constrain the batch size | | S t | | to be much less than n. Given the descent direction −Q −1 t f t , we perform an Armijo-style [5] backtracking line search (see Algorithm 1 for particulars) using the function 1 �St� ∑ j∈S t log g j to determine a good step size 0 < t < 1 prior to updating Proceeding in this way, each optimization step performs a Newton update on a subsampled surrogate of the true objective.
Given some initialization 0 , this method produces a sequence of estimates 1 , 2 , … that under certain conditions tends towards the solution to problem (1). For a precise analysis of this second-order approach to stochastic optimization (in the less restrictive setting that the functions g j are convex), see Roosta-Khorasani and Mahoney [65] and Bollapragada, Byrd, and Nocedal [13]. Thesis and outline. Exchanging the full objective function for subsampled versions of it offers computational and practical benefits, but incurs a cost in terms of the reliability of the computed updates. In particular, the sub-sampled estimates (2a) and (2b) may prove quite noisy, hindering progress towards the optimum. This paper adapts a Bayesian filtering strategy in an attempt to mitigate this issue. We begin with a discussion of related work in the next section and then introduce discriminative Bayesian filtering in Sect. 3. We recast the optimization process described in this section as a discriminative filtering problem in Sect. 4, leading to an algorithm that calculates a step direction using the entire history of sub-sampled gradients and Hessians. In Sect. 5, we establish technical conditions under which the proposed algorithm behaves similarly to Polyak's momentum. In Sect. 6, we compare the standard approach outlined in this section to our proposed, filtered method using an online linear regression problem with synthetic data, before drawing conclusions in Sect. 7. (2a)

3
Discriminative Bayesian filtering lends momentum to the…

Related work
In this section, we provide a brief overview of related work, separated thematically into paragraphs. Filtering methods have previously been applied to stochastic optimization problems, with notable success. Houlsby and Blei [33] characterized online stochastic variational inference [31] as a (non-discriminative) filtering problem using the standard Kalman filter where the covariance matrix was restricted to be isotropic and demonstrated promising results training both latent Dirichlet allocation [11,30,63] and Bayesian matrix factorization models [29]. For least squares problems, Bertsekas [9] demonstrated how the extended Kalman filter could be applied to form batch-based updates. More recently, Akyıldız [3] and Liu [42] developed filtered versions of the incremental proximal method [10]. In a more general setting, Stinis [70] phrased stochastic optimization as a filtering problem and proposed particle filter-based inference [77,78].
While momentum and momentum-like approaches have been thoroughly explored for stochastic problems in general [14,24,35,62,66,67,69,71] and for the stochastic Newton method when restricted to solving linear systems [43], momentum for more general cases of stochastic Newton has received comparatively little attention.
As the parameter space becomes high-dimensional, the computational costs for inverting the Hessian matrix grow cubically. Hessian Free approaches entirely circumvent the construction and subsequent inversion of the Hessian [47,52,75] by directly computing matrix-vector products using the conjugate gradient method or the Pearlmutter trick [58]. Berahas, Bollapragada, and Nocedal [7] explore sketching [1,44,55,56,59] as an alternative to sub-sampling.
Backtracking line search plays an important role in Roosta-Khorasani and Mahoney's [65] convergence results and inspired the use of line search in this work. In contrast to the more traditional stochastic approximation results that stipulate ∑ ∞ t=1 a t = ∞ and ∑ ∞ t=1 a 2 t < ∞ where a t > 0 are step sizes [64], many variants of stochastic Newton use either line search or fixed step lengths. In the stochastic setting, line search remains an area of active research [8,45,57,73].
Other recent innovations for the stochastic Newton method include non-uniform [76] and adaptive sampling strategies [12,26] for the batches S t , low-rank approximation for the sub-sampled Hessians [25], and alternate formulations for the inverse Hessian [2]. Any of these approaches could be applied to the method we develop in the remainder of this paper.

3 3 Discriminative Bayesian filtering
Consider a state-space model relating a sequence Z 1∶t = Z 1 , Z 2 , … , Z t of latent random variables to a corresponding sequence of observed measurements X 1∶t = X 1 , X 2 , … , X t according to the Bayesian network: At each successive point t in time, filtering aims to infer the current hidden state Z t given all currently available measurements X 1∶t . We find that such an estimate often provides more accurate and more stable performance than an estimate for Z t given only the most recent measurement X t . This is expected, as we know that conditioning reduces entropy [21, thm 2.6.5] and that the law of total variance 1 implies where we use [⋅] and [⋅] to denote expectation and (co)variance, respectively. In particular, conditioning also reduces variance on average.
In Bayesian filtering, inference takes a distributional form. Given a state model p(z t |z t−1 ) that describes the evolution of the latent state and a measurement model p(x t |z t ) that relates the current observation and current latent state, Bayesian filtering methods iteratively infer or approximate the posterior distribution p(z t |x 1∶t ) of the current latent state given all available measurements at the current point in time. To this end, the Chapman-Kolmogorov recursion relates the current and previous posteriors in terms of the state and measurement models, up to a constant depending on the observations alone. The Kalman filter provides a quintessential example, where both the state and measurement models are chosen to be linear and Gaussian [37]. For nonlinear Gaussian models, the extended Kalman filter performs linearization prior to applying the standard Kalman updates. In general, the integrals required to compute (4) prove intractable. Assumed density filters employ variational methods to fit models to a tractable family of distributions [34,40], sigma-point filters such as the unscented Kalman filter apply Discriminative Bayesian filtering lends momentum to the… quadrature [36,51], and particle filters perform Monte Carlo integration [27,28]. For comprehensive surveys of Bayesian filtering, consult Chen [20] and Särkkä [68].
In some cases, it may be easier to calculate or approximate p(z t |x t ) than the typical observation model p(x t |z t ) . In order to use the conditional distribution of latent states given observations for filtering, we may apply Bayes' rule to find that p(x t |z t ) ∝ p(z t |x t )∕p(z t ) up to a constant in x t and re-write (4) as We characterize discriminative filtering frameworks as those that exchange a generative model (in the sense of Ng and Jordan [54]) for the ability to use p(z t |x t ) for inference. Well-known examples include maximum entropy Markov models [49] and conditional random fields [41], with applications including natural language processing (ibid.), gene prediction [22,74], human motion tracking [38,72], and neural modeling [6,16].
In this paper, we focus on the Discriminative Kalman Filter (DKF) [17,18] that specifies both the state and discriminative observation models as Gaussian: where A ∈ ℝ d×d and Γ ∈ d parameterize the Kalman state model for the set d of valid d×d covariance matrices, f ∶ X → ℝ d and Q ∶ X → d parameterize the discriminative model for an abstract space X , and d (⋅; , Σ) denotes the d-dimensional Gaussian density function with mean ∈ ℝ d and covariance Σ ∈ d . With initialization p(z 0 ) = d (z 0 ; , S) where S ∈ d satisfies S = ASA ⊺ + Γ , the unconditioned latent process is stationary. The function f here may be non-linear. If the posterior at time t − 1, is approximately Gaussian then it follows from the model (6, 7) and the recursion (5) that the posterior at time t, can also be approximated as Gaussian, where In fact, this approximation is exact when the matrix Q(x t ) −1 − S −1 is positive definite [18, p. 973]; if this fails to be the case, the DKF specifies in place of (10b). In this way, closed-form updates for the DKF's posterior require only the inversion and multiplication of d×d matrices, upon evaluation of the functions f and Q.
In Sect. 1, we considered an optimization scheme that iteratively obtains subsampled values for the objective function, its gradient, and Hessian in a small neighborhood around the current parameter value. It then estimates an optimal direction of descent given only the current observations and parameter value. In this section, we showed how a discriminative Gaussian approximation (7) can be used with a latent state model (6) to consider the entire history of observations when performing inference. In the next section, we will apply this discriminative filtering process to forming updates for the stochastic Newton method.

Stochastic optimization as a filtering problem
When the batch size | | S t | | is small, the stochastic estimates obtained for the gradient (2a) and Hessian (2b) of may prove to be quite noisy. To remedy this, we now outline a filtering method that incorporates multiple batches' worth of noisy measurement information to inform its estimate for Z t = ∇ ( t−1 ) . At each step, we let X t denote the current parameter value along with the function, gradient, and Hessian of 1 �St� ∑ j∈S t log g j obtained from the uniform random sample S t in a neighborhood of t−1 . In order to iteratively update our distributional estimate for Z t given all available observations using the discriminative Kalman filter (DKF) as described in the previous section, we must first specify a discriminative measurement model and state model of the required form. After formulating these models, we then describe how to use the resulting filtered estimates in our optimization framework.

Measurement model
Given some observation x t of the random variable X t , which in this case corresponds to local information for the sub-sampled function 1 �St� ∑ j∈S t log g j in a neighborhood of t−1 , we form a Gaussian approximation for the conditional distribution of Z t as where the mean f t and covariance Q t refer to (2a) and (2b), respectively. While other authors have justified similar Gaussian approximations using the sub-sampled gradient via the Central Limit Theorem [45,46], we stress that we expect Q t ≈ ∇ 2 ( t ) in the large-sample setting: in particular, we do not intend our covariance estimate Q t to tend to zero when d = 1 (or toward singularity when d > 1). 2 If the functions g j ( ) in (1) are themselves probability density functions, so that we seek that minimizes the observed negative log likelihood for 1 , 2 , … , n ∼ i.i.d. p * and some underlying distribution p * where * denotes the true parameter in a family {p } ∈Θ of parametrized distributions, then with ( , ) defined analogously to (1) we have so that f t from (2a) with the functions log g j ( ) as specified in (12) is an unbiased Monte Carlo estimate for the expected gradient of the objective. Furthermore, the Fisher information equality implies so that for t−1 near the the optimum * = arg min { ( )} , we have and the sub-sampled Hessian Q t from (2b) under the specification (12) should form a reasonable approximation to the variance of the gradient. In this case, the step direction −Q −1 t f t takes the form of the natural gradient [4,48].

State model
We want our latent state estimate to evolve continuously, so we specify the state model for 0 < < 1 and 0 < , and define S = 1− 2 I d where I d is the d-dimensional identity matrix. This autoregressive model with a single lag allows the previous gradient estimate to influence the current gradient estimate. In particular, we stipulate a correlation of between z t (i) and z t−1 (i) where z(i) denotes the i-th coordinate of z.

Resulting estimates and filtered optimization scheme
We now filter the state-space model described above to obtain iterative estimates for the posterior distribution. Starting with the previous approximation for the optimal descent direction given all available observations, we may apply the DKF to recursively approximate the next posterior p(z t |x 1∶t ) ≈ d (z t ; t , Σ t ) as Gaussian under (11) and (16), where This recursive approximation inspires a novel optimization scheme similar in nature to the standard stochastic Newton method introduced in Sect. 1, where we replace the unfiltered estimates f t and Q t with our filtered estimates t and Σ t , respectively, at each update step. Given the same problem (1) and initialization, at each step t ≥ 1 , we now take the search direction −Σ −1 t t . We then perform an Armijo-style backtracking line search using 1 �St� ∑ j∈S t log g j . See Algorithm 2 for pseudo-code and complete details.
Calculating the posterior p(z t |x 1∶t ) requires only minimal additional computational and storage costs in comparison to the standard stochastic Newton method. We introduce two hyperparameters, and , to control the influence of previous observations. Intuitively, the impact of previous updates should fade over time as our current estimate moves further away from the parameter values associated with the previouslysubsampled gradients and Hessians. In the next section, we will make this intuition more precise by outlining conditions on the hyperparameters under which the impact of previous updates decays exponentially.

The connection with momentum
We would like to view our updates as analogous to Polyak's heavy ball momentum [61,67]. In the context of optimization, momentum allows previous update directions to influence the current update direction, typically in the form of an exponentially-decaying average. This section explores how our filtered approach to optimization results in momentum-like behavior for the step direction.
To this end, we remark that from (17b) we have the recursion so that the current step direction is the sum of the current Newton update and M t times the previous step direction, where we define for t ≥ 2 . In the standard formulation of momentum, a scalar 0 < m < 1 or diagonal matrix commonly takes the place of M t , so that momentum acts in a coordinatewise manner. In contrast, our matrix M t generally contains off-diagonal elements.
To view our updates in the context of momentum, we need to establish matrix-based conditions for M t to dampen the impact of previous estimates over time.
For any positive-definite, Hermitian matrix M ∈ ℝ d×d , let min (M) and max (M) denote its smallest and largest eigenvalues, respectively. With this notation, (M) = max (M) corresponds to the spectral norm (as all eigenvalues are positive), and we have Proof As the spectral norm is sub-multiplicative and max (M −1 ) = 1∕ min (M) for positive-definite matrices, we have from (19) so that the impact of older updates exponentially decays over time, as one would expect from momentum. We also note that, if 0 < Λ 1 ≤ Λ d exist, then 0 < < 1 and 0 < may always be chosen to satisfy Λ d < 2 Λ 1 + . For example, we may let = 1∕2 and = Λ d .

Illustrated example: online linear regression
In this section, we compare the filtered method as described in Algorithm 2 to the standard, unfiltered method as described in Sect. 1 on a simple optimization problem of the form (1) that we now describe. In maximum likelihood estimation, minimizing the negative log-likelihood for a set of i.i.d. samples from a log-concave distribution produces an average of functions, each convex in the parameter of interest. We consider the problem of estimating a vector of coefficients for a discriminative linear regression model; i.e. given data y j ∈ ℝ and x j ∈ ℝ d , 1 ≤ j ≤ n , we suppose where ∈ ℝ d denotes a column vector of parameters. We aim to minimize the negative log likelihood, which can be written up to a multiplicative constant as

3
Discriminative Bayesian filtering lends momentum to the… so that ∇ log g j ( ) = (x j x ⊺ j ) − x j y j and ∇ 2 log g j ( ) = x j x ⊺ j .

Methodology
We performed a computer simulation with n = 100 and d = 2 . For 1 ≤ j ≤ n , we sampled N(1, 1) , where N(m, V) denotes a Gaussian random variable with mean m and covariance V. We set y j = ⊺ x j + j for each j. In this way, the conditional distribution of the y i respects (22), and the minimizer of (23) corresponds to the maximum likelihood estimate (MLE) for the parameter . Due to our choice of small n, the true optimum * = arg min ( ) can be calculated exactly and used to help evaluate performance. In practical applications of stochastic Newton, we would generally expect n to be much larger.
For the purposes of comparison, we ran 1000 independent paired trials starting from the same initialization. The trials performed 30 optimization steps for each method. Within each trial, the two methods received the same 5 indices S t , sampled uniformly at random with replacement from {1, … , n} at each step. These methods then garnered gradient and Hessian information from the same subsampled function at their respective current parameter values to form each subsequent update. We selected = 0.9 and = 0.2 for the filtered algorithm, but note that we generally expect these parameters to be problem-dependent.
We performed our comparisons on a 2020 MacBook Pro (Apple M1 Chip; 16 GB LPDDR4 Memory) using Python (v.3.10.2) and its Numpy package (v. 1.22.3). We include code to reproduce the results and figures that follow as part of our supplementary material.

Results
We plot the evolution of three randomly selected paths for both methods in Fig. 1 and present a graphical summary of the aggregate results of 1000 independent trials in Fig. 2. We note that the filtered method tends to reach a neighborhood of the optimum in around 10 steps, while the unfiltered method commonly takes 30 steps or more (see Fig. 2a).
Prior to reaching a neighborhood of the optimum (where, according to Fig. 2b the function seems to flatten out), the filtered estimates appear to be smoother than those of the unfiltered estimates (see Fig. 1). We make this observation more numerically precise by considering the signed angular difference (in radians) between the optimal descent direction and the calculated step direction before and after the filtering process. We record the mean square of this angular error (MSE) in Table 1 for the crucial first few iterations, where both methods are taking their largest steps, and find that filtering helps to reduce MSE appreciably. As the squared bias tends to be small ( ≤ 0.010 for both methods over the first 5 steps), we see a corresponding reduction in the variance of the error as well. 3 As discussed in Sect. 3, this reduction in error was one of the original motivations for applying filtering. We note that filtering allows paths to accelerate early on in their trajectory (see Fig. 2c) and reach a neighborhood of the optimum well before the unfiltered method (see Fig. 2a). Additionally, we monitored (M t ) , as discussed in the previous section, and found that for t > 5 , (M t ) < 0.8 for all 1000 trajectories. Consequently, we observe the exponential decay of the coefficients for each Q −1 i f i as written in (21).

Exponential families and generalized linear models
We now consider how this section's linear regression example may be generalized.
To this end, we introduce the exponential family [23,39,60], consisting of probability distributions of the form Fig. 1 We plot three trajectories for the unfiltered and filtered methods (acting on the same samples) starting from the grey dot, and heading towards the global optimum (the red triangle) for the full objective function in (1) Fig. 2 Upon sampling 1000 trajectories for both the filtered and unfiltered method starting with the same initialization and receiving the same randomness, we plot the average values ± 2 standard deviations for a the Euclidean distance between the estimate and optimum, b the value of the entire (non-sampled) function at the current estimate, and c the distance between the current and previous estimate, all versus the step number 1 3

(a) (b) (c)
Discriminative Bayesian filtering lends momentum to the… for x ∈ ℝ , natural parameter ∈ ℝ d , sufficient statistic T ∶ ℝ → ℝ d , log-normalizer A ∶ ℝ d → ℝ , and non-negative h ∶ ℝ → ℝ . (Note that our notation differs from that of most texts: authors typically let denote the natural parameter, but we use to maintain the notation from previous sections.) Given i.i.d. samples x 1 , … , x n from such a distribution, the MLE for can be characterized as a solution to (1) using the negative log likelihood, where with and In particular, at each optimization step, the gradient and Hessian of each g j ( t−1 ) will always be the expectation and variance, respectively, of . Generalized linear models [53] with canonical response functions model conditional distributions using the exponential family. For y j ∈ ℝ and x j ∈ ℝ d , 1 ≤ i ≤ n , and ∈ ℝ d we write so that the MLE for again solves (1) with Applying the chain rule to (25) and (26) . Thus, our algorithm may readily be applied to find the MLE for models of the above form, with a slight perturbation to the Hessian to ensure positive definiteness.
For a more standard presentation using the overdispersed exponential family, see McCullagh and Nelder [50].
(27) p(y j �x j , ) = h(y j ) exp(⟨ j , T(y j )⟩ − A( j )), where j = ⊺ x j , (28) log g j ( ) ∶= − log p(y j �x j , ) = A( ⊺ x j ) − log h(y j ) − ⟨ ⊺ x j , T(y j )⟩. Table 1 We report the mean square angular error (over the 1000 trials) for both methods during the first 5 steps. Here, the unfiltered step direction is taken to be −Q −1 t f t for f t and Q t evaluated at the current filtered estimate. As both methods implement line search to select step length, we believe angular error (in radians) may prove more pertinent to successful optimization than other, magnitude-influenced distances. Note that both estimates coincide at step 1

Conclusions and future directions of research
The stochastic Newton algorithm uses subsampled gradients and Hessians to iteratively approximate an optimal step direction for batch-based optimization. When the batch size is small, the errors of these subsampled estimates may hinder progress towards the minimum. In this work, we applied a Bayesian filtering method with a discriminative observation model to filter the sequences of gradients and Hessians. We established conditions for the resulting optimization algorithm to behave similarly to Polyak's momentum, allowing the impact of older updates to fade over time.
We illustrated how our method improves performance on a simple example and discussed how the algorithm can be applied more generally to inference for the exponential family.
In the future, we would like to consider possible solutions to two main drawbacks of our approach as currently formulated. In many practical applications, the high dimensionality of the parameter causes maintaining and inverting the Hessian matrix to be prohibitively expensive. Hessian free methods and the large body of research on quasi-Newton methods [15] may offer some help here. Secondly, from a theoretical perspective, our method would benefit from algorithm termination conditions and associated convergence results. The results of Roosta-Khorasani and Mahoney [65, thm. 4] and Bollapragada, Byrd, and Nocedal [13, thm. 2.2] are most germane to our work, but further modifications would be necessary.
We believe that stochastic optimization provides a natural setting for sequential Bayesian inference and anticipate further advances in this direction.