Statistics and Computing

, Volume 24, Issue 3, pp 339–349

Split Hamiltonian Monte Carlo

Authors

    • Department of Statistics and Department of Computer ScienceUniversity of California
  • Shiwei Lan
    • Department of StatisticsUniversity of California
  • Wesley O. Johnson
    • Department of StatisticsUniversity of California
  • Radford M. Neal
    • Department of Statistics and Department of Computer ScienceUniversity of Toronto
Article

DOI: 10.1007/s11222-012-9373-1

Cite this article as:
Shahbaba, B., Lan, S., Johnson, W.O. et al. Stat Comput (2014) 24: 339. doi:10.1007/s11222-012-9373-1

Abstract

We show how the Hamiltonian Monte Carlo algorithm can sometimes be speeded up by “splitting” the Hamiltonian in a way that allows much of the movement around the state space to be done at low computational cost. One context where this is possible is when the log density of the distribution of interest (the potential energy function) can be written as the log of a Gaussian density, which is a quadratic function, plus a slowly-varying function. Hamiltonian dynamics for quadratic energy functions can be analytically solved. With the splitting technique, only the slowly-varying part of the energy needs to be handled numerically, and this can be done with a larger stepsize (and hence fewer steps) than would be necessary with a direct simulation of the dynamics. Another context where splitting helps is when the most important terms of the potential energy function and its gradient can be evaluated quickly, with only a slowly-varying part requiring costly computations. With splitting, the quick portion can be handled with a small stepsize, while the costly portion uses a larger stepsize. We show that both of these splitting approaches can reduce the computational cost of sampling from the posterior distribution for a logistic regression model, using either a Gaussian approximation centered on the posterior mode, or a Hamiltonian split into a term that depends on only a small number of critical cases, and another term that involves the larger number of cases whose influence on the posterior distribution is small.

Keywords

Markov chain Monte CarloHamiltonian dynamicsBayesian analysis

1 Introduction

The simple Metropolis algorithm (Metropolis et al. 1953) is often effective at exploring low-dimensional distributions, but it can be very inefficient for complex, high-dimensional distributions—successive states may exhibit high autocorrelation, due to the random walk nature of the movement. Faster exploration can be obtained using Hamiltonian Monte Carlo, which was first introduced by Duane et al. (1987), who called it “hybrid Monte Carlo”, and which has been recently reviewed by Neal (2010). Hamiltonian Monte Carlo (HMC) reduces the random walk behavior of Metropolis by proposing states that are distant from the current state, but nevertheless have a high probability of acceptance. These distant proposals are found by numerically simulating Hamiltonian dynamics for some specified amount of fictitious time.

For this simulation to be reasonably accurate (as required for a high acceptance probability), the stepsize used must be suitably small. This stepsize determines the number of steps needed to produce the proposed new state. Since each step of this simulation requires a costly evaluation of the gradient of the log density, the stepsize is the main determinant of computational cost.

In this paper, we show how the technique of “splitting” the Hamiltonian (Leimkuhler and Reich 2004; Neal 2010) can be used to reduce the computational cost of producing proposals for Hamiltonian Monte Carlo. In our approach, splitting separates the Hamiltonian, and consequently the simulation of the dynamics, into two parts. We discuss two contexts in which one of these parts can capture most of the rapid variation in the energy function, but is computationally cheap. Simulating the other, slowly-varying, part requires costly steps, but can use a large stepsize. The result is that fewer costly gradient evaluations are needed to produce a distant proposal. We illustrate these splitting methods using logistic regression models. Computer programs for our methods are publicly available from http://www.ics.uci.edu/~babaks/Site/Codes.html.

Before discussing the splitting technique, we provide a brief overview of HMC. See (Neal 2010) for an extended review of HMC. To begin, we briefly discuss a physical interpretation of Hamiltonian dynamics. Consider a frictionless puck that slides on a surface of varying height. The state space of this dynamical system consists of its position, denoted by the vector q, and its momentum (mass, m, times velocity, v), denoted by a vector p. Based on q and p, we define the potential energy, U(q), and the kinetic energy, K(p), of the puck. U(q) is proportional to the height of the surface at position q. The kinetic energy is m|v|2/2, so K(p)=|p|2/(2m). As the puck moves on an upward slope, its potential energy increases while its kinetic energy decreases, until it becomes zero. At that point, the puck slides back down, with its potential energy decreasing and its kinetic energy increasing.

The above dynamic system can be represented by a function of q and p known as the Hamiltonian, which for HMC is usually defined as the sum of a potential energy, U, depending only on the position and a kinetic energy, K, depending only on the momentum:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equ1_HTML.gif
(1)
The partial derivatives of H(q,p) determine how q and p change over time, according to Hamilton’s equations:
$$ \begin{array}{rcl} \displaystyle\frac{dq_{j}}{dt} & = & \displaystyle\frac{\partial H}{\partial p_{j}} = \displaystyle\frac{\partial K}{\partial p_{j}} \\[16pt] \displaystyle\frac{dp_{j}}{dt} & = & \displaystyle- \frac{\partial H}{\partial q_{j}} = \displaystyle- \frac{\partial U}{\partial q_{j}} \end{array} $$
(2)
These equations define a mapping Ts from the state at some time t to the state at time t+s.
We can use Hamiltonian dynamics to sample from some distribution of interest by defining the potential energy function to be minus the log of the density function of this distribution (plus any constant). The position variables, q, then correspond to the variables of interest. We also introduce fictitious momentum variables, p, of the same dimension as q, which will have a distribution defined by the kinetic energy function. The joint density of q and p is defined by the Hamiltonian function as
$$P(q, p) = \frac{1}{Z} \exp\bigl[-H(q, p) \bigr] $$
When H(q,p)=U(q)+K(p), as we assume in this paper, we have
$$P(q, p) = \frac{1}{Z}\exp\bigl[ U(q) \bigr] \exp\bigl[ -K(p) \bigr] $$
so q and p are independent. Typically, K(p)=pTM−1p/2, with M usually being a diagonal matrix with elements m1,…,md, so that \(K(p) = \sum_{i} p^{2}_{i}/2m_{i}\). The pj are then independent and Gaussian with mean zero, with pj having variance mj.
In applications to Bayesian statistics, q consists of the model parameters (and perhaps latent variables), and our objective is to sample from the posterior distribution for q given the observed data D. To this end, we set
$$U(q) = -\log\bigl[P(q)L(q|D)\bigr] $$
where P(q) is our prior and L(q|D) is the likelihood function given data D.

Having defined a Hamiltonian function corresponding to the distribution of interest (e.g., a posterior distribution of model parameters), we could in theory use Hamilton’s equations, applied for some specified time period, to propose a new state in the Metropolis algorithm. Since Hamiltonian dynamics leaves invariant the value of H (and hence the probability density), and preserves volume, this proposal would always be accepted. (For a more detailed explanation, see Neal 2010.)

In practice, however, solving Hamiltonian’s equations exactly is too hard, so we need to approximate these equations by discretizing time, using some small step size ε. For this purpose, the leapfrog method is commonly used. It consists of iterating the following steps:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equ3_HTML.gif
(3)
In a typical case when \(K(p) = \sum_{i} p^{2}_{i}/2m_{i}\), the time derivative of qj is ∂K/∂pj=pj/mj. The computational cost of a leapfrog step will then usually be dominated by evaluation of ∂U/∂qj.
We can use some number, L, of these leapfrog steps, with some stepsize, ε, to propose a new state in the Metropolis algorithm. We apply these steps starting at the current state (q,p), with fictitious time set to t=0. The final state, at time t=, is taken as the proposal, (q,p). To make the proposal symmetric, we would need to negate the momentum at the end of the trajectory (supposing that K(p)=K(−p)), but here p will be replaced anyway (see below) before the next update. This proposal is then either accepted or rejected (with the state remaining unchanged), with the acceptance probability being
$$\min\bigl[1, \exp\bigl(-H\bigl(q^{*}, p^{*}\bigr)+H(q, p) \bigr)\bigr] $$

These Metropolis updates will leave H approximately constant, and therefore do not explore the whole joint distribution of q and p. The HMC method therefore alternates these Metropolis updates with updates in which the momentum is sampled from its distribution (which is independent of q when H has the form in Eq. (1)). When \(K(p) = \sum_{i} p^{2}_{i}/2m_{i}\), each pj is sampled independently from the Gaussian distribution with mean zero and variance mj.

As an illustration, consider sampling from the following bivariate normal distribution
$$q \sim N(\mu, \varSigma), \quad\textrm{with } \mu= \binom{3}{3}\ \textrm{and}\ \varSigma= \binom{1 \quad0.95}{0.95\quad1} $$
For HMC, we set L=20 and ε=0.15. The left plot in Figure 1 shows the first 30 states from an HMC run started with q=(0,0). The density contours of the bivariate normal distribution are shown as gray ellipses. The right plot shows every 20th state from the first 600 iterations of a run of a simple random walk Metropolis (RWM) algorithm. (This takes time comparable to that for the HMC run.) The proposal distribution for RWM is a bivariate normal with the current state as the mean, and 0.152I2 as the covariance matrix. (The standard deviation of this proposal is the same as the stepsize of HMC.) Figure 1 shows that HMC explores the distribution more efficiently, with successive samples being further from each other, and autocorrelations being smaller. For an extended review of HMC, its properties, and its advantages over the simple random walk Metropolis algorithm, see Neal (2010).
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Fig1_HTML.gif
Fig. 1

Comparison of Hamiltonian Monte Carlo (HMC) and Random Walk Metropolis (RWM) when applied to a bivariate normal distribution. Left plot: The first 30 iterations of HMC with 20 leapfrog steps. Right plot: The first 30 iterations of RWM with 20 updates per iterations

In this example, we have assumed that one leapfrog step for HMC (which requires evaluating the gradient of the log density) takes approximately the same computation time as one Metropolis update (which requires evaluating the log density), and that both move approximately the same distance. The benefit of HMC comes from this movement being systematic, rather than in a random walk.1 We now propose a new approach called Split Hamiltonian Monte Carlo (Split HMC), which further improves the performance of HMC by modifying how steps are done, with the effect of reducing the time for one step or increasing the distance that one step moves.

2 Splitting the Hamiltonian

As discussed by Neal (2010), variations on HMC can be obtained by using discretizations of Hamiltonian dynamics derived by “splitting” the Hamiltonian, H, into several terms:
$$H(q, p) = H_{1}(q, p) + H_{2}(q, p) + \cdots + H_{K}(q, p) $$
We use Ti,t, for i=1,…,k to denote the mapping defined by Hi for time t. Assuming that we can implement Hamiltonian dynamics for Hk exactly, the composition T1,εT2,ε∘⋯∘Tk,ε is a valid discretization of Hamiltonian dynamics based on H if the Hi are twice differentiable (Leimkuhler and Reich 2004). This discretization is symplectic and hence preserves volume. It will also be reversible if the sequence of Hi are symmetric: Hi(q,p)=HKi+1(q,p).
Indeed, the leapfrog method (3) can be regarded as a symmetric splitting of the Hamiltonian H(q,p)=U(q)+K(p) as
$$ H(q, p) = U(q)/2 + K(p) + U(q)/2 $$
(4)
In this case, H1(q,p)=H3(q,p)=U(q)/2 and H2(q,p)=K(p). Hamiltonian dynamics for H1 is
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equg_HTML.gif
which for a duration of ε gives the first part of a leapfrog step. For H2, the dynamics is
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equh_HTML.gif
For time ε, this gives the second part of the leapfrog step. Hamiltonian dynamics for H3 is the same as that for H1 since H1=H3, giving the third part of the leapfrog step.

2.1 Splitting the Hamiltonian when a partial analytic solution is available

Suppose the potential energy U(q) can be written as U0(q)+U1(q). Then, we can split H as
$$ H(q, p) = U_{1}(q)/2 + \bigl[U_{0}(q) + K(p)\bigr] + U_{1}(q)/2 $$
(5)
Here, H1(q,p)=H3(q,p)=U1(q)/2 and H2(q,p)=U0(p)+K(p). The first and the last terms in this splitting are similar to Eq. (4), except that U1(q) replaces U(q), so the first and the last part of a leapfrog step remain as before, except that we use U1(q) rather than U(q) to update p. Now suppose that the middle part of the leapfrog, which is based on the Hamiltonian U0(q)+K(p), can be handled analytically—that is, we can compute the exact dynamics for any duration of time. We hope that since this part of the simulation introduces no error, we will be able to use a larger step size, and hence take fewer steps, reducing the computation time for the dynamical simulations.

We are mainly interested in situations where U0(q) provides a reasonable approximation to U(q), and in particular on Bayesian applications, where we approximate U by focusing on the posterior mode, \(\hat{q}\), and the second derivatives of U at that point. We can obtain \(\hat{q}\) using fast methods such as Newton-Raphson iteration when analytical solutions are not available. We then approximate U(q) with U0(q), the energy function for \(N(\hat{q}, \mathcal{J}^{-1}(\hat{q}))\), where \(\mathcal{J}(\hat{q})\) is the Hessian matrix of U at \(\hat{q}\). Finally, we set U1(q)=U(q)−U0(q), the error in this approximation.

Beskos et al. (2011) have recently proposed a similar splitting strategy for HMC, in which a Gaussian component is handled analytically, in the context of high-dimensional approximations to a distribution on an infinite-dimensional Hilbert space. In such applications, the Gaussian distribution will typically be derived from the problem specification, rather than being found as a numerical approximation, as we do here.

Using a normal approximation in which \(U_{0}(q) = \frac{1}{2} (q - \hat{q})^{T} \mathcal{J} (\hat{q}) (q - \hat{q})\), and letting \(K(p) = \frac{1}{2} p^{T}p\) (the energy for the standard normal distribution), H2(q,p)=U0(q)+K(p) in Eq. (5) will be quadratic, and Hamilton’s equations will be a system of first-order linear differential equations that can be handled analytically (Polyanin et al. 2002). Specifically, setting \(q^{*} = q - \hat{q}\), the dynamical equations can be written as follows:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equi_HTML.gif
where I is the identity matrix. Defining X=(q,p), this can be written as \(\frac{d}{d t} X(t) = A X(t)\), where
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equj_HTML.gif
The solution of this system is X(t)=eAtX0, where X0 is the initial value at time t=0, and eAt=I+(At)+(At)2/2!+⋯ is a matrix exponential. This can be simplified by diagonalizing the coefficient matrix A as
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equk_HTML.gif
where Γ is invertible and D is a diagonal matrix. The system of equations can then be written as
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equl_HTML.gif
Now, let Y(t)=Γ−1X(t). Then,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equm_HTML.gif
The solution for the above equation is Y(t)=eDtY0, where Y0=Γ−1X0. Therefore,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equn_HTML.gif
and eDt can be easily computed by simply exponentiating the diagonal elements of D times t.
The above analytical solution is of course for the middle part (denoted as H2) of Eq. (5) only. We still need to approximate the overall Hamiltonian dynamics based on H, using the leapfrog method. Algorithm 1 shows the corresponding leapfrog steps—after an initial step of size ε/2 based on U1(q), we obtain the exact solution for a time step of ε based on H2(q,p)=U0(q)+K(p), and finish by taking another step of size ε/2 based on U1(q).
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Fig2_HTML.gif
Algorithm 1

Leapfrog for split Hamiltonian Monte Carlo with a partial analytic solution

2.2 Splitting the Hamiltonian by splitting the data

The method discussed in the previous section requires that we be able to handle the Hamiltonian H2(q,p)=U0(q)+K(p) analytically. If this is not so, splitting the Hamiltonian in this way may still be beneficial if the computational cost for U0(q) is substantially lower than for U(q). In these situations, we can use the following split:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equ6_HTML.gif
(6)
for some M>1. The above discretization can be considered as a nested leapfrog, where the outer part takes half steps to update p based on U1 alone, and the inner part involves M leapfrog steps of size ε/M based on U0. Algorithm 2 implements this nested leapfrog method.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Fig3_HTML.gif
Algorithm 2

Nested leapfrog for split Hamiltonian Monte Carlo with splitting of data

For example, suppose our statistical analysis involves a large data set with many observations, but we believe that a small subset of data is sufficient to build a model that performs reasonably well (compared to the model that uses all the observations). In this case, we can construct U0(q) based on a small part of the observed data, and use the remaining observations to construct U1(q). If this strategy is successful, we will able to use a large stepsize for steps based on U1, reducing the cost of a trajectory computation.

In detail, we divide the observed data, y, into two subsets: R0, which is used to construct U0(q), and R1, which is used to construct U1:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equ7_HTML.gif
(7)
Note that the prior appears in U0(θ) only.

Neal (2010) discusses a related strategy for splitting the Hamiltonian by splitting the observed data into multiple subsets. However, instead of randomly splitting data, as proposed there, here we split data by building an initial model based on the maximum a posterior (MAP) estimate, \(\hat{q}\), and use this model to identify a small subset of data that captures most of the information in the full data set.

3 Application of Split HMC to logistic regression models

We now look at how Split HMC can be applied to Bayesian logistic regression models for binary classification problems. We will illustrate this method using the simulated data set with n=100 data points and p=2 covariates that is shown in Fig. 2.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Fig4_HTML.gif
Fig. 2

An illustrative binary classification problem with n=100 data points and two covariates, x1 and x2, with the two classes represented by white circles and black squares

The logistic regression model assigns probabilities to the two possible classes (denoted by 0 and 1) in case i (for i=1,…,n) as follows:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equ8_HTML.gif
(8)
Here, xi is the vector of length p with the observed values of the covariates in case i, α is the intercept, and β is the vector of p regression coefficients. We use θ to denote the vector of all p+1 unknown parameters, (α,β).
Let P(θ) be the prior distribution for θ. The posterior distribution of θ given x and y is proportional to \(P(\theta)\prod_{i=1}^{n}P(y_{i}|x_{i}, \theta)\). The corresponding potential energy function is
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equo_HTML.gif
We assume the following (independent) priors for the model parameters:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equp_HTML.gif
where σα and σβ are known constants.
The potential energy function for the above logistic regression model is therefore as follows:
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equq_HTML.gif
The partial derivatives of the energy function with respect to α and the βj are
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equr_HTML.gif

3.1 Split HMC with a partial analytical solution for a logistic model

To apply Algorithm 1 for Split HMC to this problem, we approximate the potential energy function U(θ) for the logistic regression model with the potential energy function U0(θ) of the normal distribution \(N(\hat{\theta}, \mathcal{J}^{-1}(\hat{\theta}))\), where \(\hat{\theta}\) is the MAP estimate of model parameters. U0(θ) usually provides a reasonable approximation to U(θ), as illustrated in Fig. 3. In the plot on the left, the solid curve shows the value of the potential energy, U, as β1 varies, with β2 and α fixed to their MAP values, while the dashed curve shows U0 for the approximating normal distribution. The right plot of Fig. 3 compares the partial derivatives of U and U0 with respect to β1, showing that ∂U0/∂βj provides a reasonable linear approximation to ∂U/∂βj.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Fig5_HTML.gif
Fig. 3

Left plot: The potential energy, U, for the logistic regression model (solid curve) and its normal approximation, U0 (dashed curve), as β1 varies, with other parameters at their MAP values. Right plot: The partial derivatives of U and U0 with respect to β1

Since there is no error when solving Hamiltonian dynamics based on U0(θ), we would expect that the total discretization error of the steps taken by Algorithm 1 will be less that for the standard leapfrog method, for a given stepsize, and that we will therefore be able to use a larger stepsize—and hence need fewer steps for a given trajectory length—while still maintaining a good acceptance rate. The stepsize will still be limited to the region of stability imposed by the discretization error from U1=UU0, but this limit will tend to be larger than for the standard leapfrog method.

3.2 Split HMC with splitting of data for a logistic model

To apply Algorithm 2 to this logistic regression model, we split the Hamiltonian by splitting the data into two subsets. Consider the illustrative example discussed above. In the left plot of Fig. 4, the thick line represents the classification boundary using the MAP estimate, \(\hat{\theta}\). For the points that fall on this boundary line, the estimated probabilities for the two groups are equal, both being 1/2. The probabilities of the two classes become less equal as the distance of the covariates from this line increases. We will define U0 using the points within the region, R0, within some distance of this line, and define U1 using the points in the region, R1, at a greater distance from this line. Equivalently, R0 contains those points for which the probability that y=1 (based on the MAP estimates) is closest to 1/2.
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Fig6_HTML.gif
Fig. 4

Left plot: A split of the data into two parts based on the MAP model, represented by the solid line; the energy function U is then divided into U0, based on the data points in R0, and U1, based on the data points in R1. Right plot: The partial derivatives of U and U0 with respect to β1, with other parameters at their MAP values

The shaded area in Fig. 4 shows the region, R0, containing the 30 % of the observations closest to the MAP line, or equivalently the 30 % of observations for which the probability of class 1 is closest (in either direction) to 1/2. The unshaded region containing the remaining data points is denoted as R1. Using these two subsets, we can split the energy function U(θ) into two terms: U0(θ) based on the data points that fall within R0, and U1 based on the data points that fall within R1 (see Eq. (7)). Then, we use Eq. (6) to split the Hamiltonian dynamics.

Note that U0 is not used to approximate the potential energy function, U, for the acceptance test at the end of the trajectory—the exact value of U is used for this test, ensuring that the equilibrium distribution is exactly correct. Rather, ∂U0/∂βj is used to approximate ∂U/∂βj, which is the costly computation when we simulate Hamiltonian dynamics.

To see that it is appropriate to split the data according to how close the probability of class 1 is to 1/2, note first that the leapfrog step of Eq. (3) will have no error when the derivatives ∂U/∂qj do not depend on q—that is, when the second derivatives of U are zero. Recall that for the logistic model,
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equs_HTML.gif
from which we get
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equt_HTML.gif
The product P(yi=1|xi,α,β)[1−P(yi=1|xi,α,β)] is symmetrical around its maximum where P(yi=1|xi,α,β) is 1/2, justifying our criterion for selecting points in R0. The right plot of Fig. 4 shows the approximation of ∂U/∂β1 by ∂U0/∂β1 with β2 and α fixed to their MAP values.

4 Experiments

In this section, we use simulated and real data to compare our proposed methods to standard HMC. For each problem, we set the number of leapfrog steps to L=20 for standard HMC, and find ε such that the acceptance probability (AP) is close to 0.65 (Neal 2010). We set L and ε for the Split HMC methods such that the trajectory length, εL, remains the same, but with a larger stepsize and hence a smaller number of steps. Note that this trajectory length is not necessarily optimal for these problems, but this should not affect our comparisons, in which the length is kept fixed.

We try to choose ε for the Split HMC methods such that the acceptance probability is equal to that of standard HMC. However, increasing the stepsize beyond a certain point leads to instability of trajectories, in which the error of the Hamiltonian grows rapidly with L (Neal 2010), so that proposals are rejected with very high probability. This sometimes limits the stepsize of Split HMC to values at which the acceptance probability is greater than the 0.65 aimed at for standard HMC. Additionally, to avoid near periodic Hamiltonian dynamics (Neal 2010), we randomly vary the stepsize over a small range. Specifically, at each iteration of MCMC, we sample the stepsize from the Uniform (0.8ε,ε) distribution, where ε is the reported stepsize for each experiment.

To measure the efficiency of each sampling method, we use the autocorrelation time (ACT), estimated by dividing the N posterior samples into batches of size B, and estimating ACT as follows (Neal 1993; Geyer 1992):
https://static-content.springer.com/image/art%3A10.1007%2Fs11222-012-9373-1/MediaObjects/11222_2012_9373_Equu_HTML.gif
Here, S2 is the sample variance and \(S^{2}_{b}\) is the sample variance of batch means. Following Thompson (2010), we divide the posterior samples into N1/3 batches of size B=N2/3. Throughout this section, we set the number of Markov chain Monte Carlo (MCMC) iterations for simulating posterior samples to N=50000.

The autocorrelation time can be roughly interpreted as the number of MCMC transitions required to produce samples that can be considered as independent. For the logistic regression problems discussed in this section, we could find the autocorrelation time separately for each parameter and summarize the autocorrelation times using their maximum value (i.e., for the slowest moving parameter) to compare different methods. However, since one common goal is to use logistic regression models for prediction, we look at the autocorrelation time, τ, for the log likelihood, \(\sum_{i=1}^{n} \log[P(y_{i} | x_{i}, \theta)]\) using the posterior samples of θ. We also look at the autocorrelation time for ∑j(βj)2 (denoting it τβ), since this may be more relevant when the goal is interpretation of parameter estimates.

We adjust τ (and similarly τβ) to account for the varying computation time needed by the different methods in two ways. One is to compare different methods using τ×s, where s is the CPU time per iteration, using an implementation written in R. This measures the CPU time required to produce samples that can be regarded as independent samples. We also compare in terms of τ×g, where g is the number of gradient computations on the number of cases in the full data set required for each trajectory simulated by HMC. This will be equal to the number of leapfrog steps, L, for standard HMC or Split HMC using a normal approximation. When using data splitting with a fraction f of data in R0 and M inner leapfrog steps, g will be (fM+(1−f))×L. In general, we expect that computation time will be dominated by the gradient computations counted by g, so that τ×g will provide a measure of performance independent of any particular implementation. In our experiments, s was close to being proportional to g, except for slightly larger than expected times for Split HMC with data splitting.

Note that compared to standard HMC, our two methods involve some computational overhead for finding the MAP estimate. However, the additional overhead associated with finding the MAP estimate remains negligible (less than a second for most examples discussed here) compared to the sampling time.

4.1 Simulated data

We first tested the methods on a simulated data set with 100 covariates and 10000 observations. The covariates were sampled as \(x_{ij} \sim N(0, \sigma^{2}_{j})\), for i=1,…,10000 and j=1,…,100, with σj set to 5 for the first five variables, to 1 for the next five variables, and to 0.2 for the remaining 90 variables. We sampled true parameter values, α and βj, independently from N(0,1) distributions. Finally, we sampled the class labels according to the model, as \(y_{i} \sim\operatorname{Bernoulli}(p_{i})\) with \(\operatorname{logit}(p_{i}) = \alpha+ x_{i}^{T}\beta\).

For the Bayesian logistic regression model, we assumed normal priors with mean zero and standard deviation 5 for α and βj, where j=1,…,100. We ran standard HMC, Split HMC with normal approximation, and Split HMC with data splitting for N=50000 iterations. For the standard HMC, we set L=20 and ε=0.015, so the trajectory length was 20×0.015=0.3. For Split HMC with normal approximation and Split HMC with data splitting, we reduce the number of leapfrog steps to 10 and 3 respectively, while increasing the stepsizes so that the trajectory length remained 0.3. For the data splitting method, we use 40 % of the data points for U0 and set M=9, which makes g equal 4.2L. Since we set L=3, we have g=12.6, which is smaller than g=L=20 used for the standard HMC algorithm.

Table 1 shows the results for the three methods. The CPU times (in seconds) per iteration, s, and τ×s for the Split HMC methods are substantially lower than for standard HMC. The comparison is similar looking at τ×g. Based on τβ×s and τβ×g, however, the improvement in efficiency is more substantial for the data splitting method compared to the normal approximation method mainly because of the difference in their corresponding values of τβ.
Table 1

Split HMC (with normal approximation and data splitting) compared to standard HMC using simulated data, on a data set with n=10000 observations and p=100 covariates. Here, L is the number of leapfrog steps, g is the number of gradient computations, s is the CPU time (in seconds) per iteration, AP is the acceptance probability, τ is the autocorrelation time based on the log likelihood, and τβ is the autocorrelation time based on ∑j(βj)2

 

HMC

Split HMC

Normal appr.

Data splitting

L

20

10

3

g

20

10

12.6

s

0.187

0.087

0.096

AP

0.69

0.74

0.74

τ

4.6

3.2

3.0

τ×g

92

32

38

τ×s

0.864

0.284

0.287

τβ

11.7

13.5

7.3

τβ×g

234

135

92

τβ×s

2.189

1.180

0.703

4.2 Results on real data sets

In this section, we evaluate our proposed method using three real binary classification problems. The data for these three problems are available from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/index.html). For all data sets, we standardized the numerical variables to have mean zero and standard deviation 1. Further, we assumed normal priors with mean zero and standard deviation 5 for the regression parameters. We used the setup described at the beginning of Sect. 4, running each Markov chain for N=50000 iterations. Table 2 summarizes the results using the three sampling methods.
Table 2

HMC and Split HMC (normal approximation and data splitting) on three real data sets. Here, L is the number of leapfrog steps, g is the number of gradient computations, s is the CPU time (in seconds) per iteration, AP is the acceptance probability, τ is the autocorrelation time based on the log likelihood, and τβ is the autocorrelation time based on ∑j(βj)2

 

HMC

Split HMC

Normal appr.

Data splitting

StatLog

n=4435, p=37

L

20

14

3

g

20

14

13.8

s

0.033

0.026

0.023

AP

0.69

0.74

0.85

τ

5.6

6.0

4.0

τ×g

112

84

55

τ×s

0.190

0.144

0.095

τβ

5.6

4.7

3.8

τβ×g

112

66

52

τβ×s

0.191

0.122

0.090

CTG

n=2126, p=21

L

20

13

2

g

20

13

9.8

s

0.011

0.008

0.005

AP

0.69

0.77

0.81

τ

6.2

7.0

5.0

τ×g

124

91

47

τ×s

0.069

0.055

0.028

τβ

24.4

19.6

11.5

τβ×g

488

255

113

τβ×s

0.271

0.154

0.064

Chess

n=3196, p=36

L

20

9

2

g

20

13

11.8

s

0.022

0.011

0.013

AP

0.62

0.73

0.62

τ

10.7

12.8

12.1

τ×g

214

115

143

τ×s

0.234

0.144

0.161

τβ

23.4

18.9

19.0

τβ×g

468

246

224

τβ×s

0.511

0.212

0.252

The first problem, StatLog, involves using multi-spectral values of pixels in a satellite image in order to classify the associated area into soil or cotton crop. (In the original data, different types of soil are identified.) The sample size for this data set is n=4435, and the number of features is p=37. For the standard HMC, we set L=20 and ε=0.08. For the two Split HMC methods with normal approximation and data splitting, we reduce L to 14 and 3 respectively while increasing ε so ε×L remains the same as that of standard HMC. For the data splitting methods, we use 40 % of data points for U0 and set M=10. As seen in the table, the Split HMC methods improve efficiency, with the data splitting method performing better than the normal approximation method.

The second problem, CTG, involves analyzing 2126 fetal cardiotocograms along with their respective diagnostic features (de Campos et al. 2000). The objective is to determine whether the fetal state class is “pathologic” or not. The data include 2126 observations and 21 features. For the standard HMC, we set L=20 and ε=0.08. We reduced the number of leapfrog steps to 13 and 2 for Split HMC with normal approximation and data splitting respectively. For the latter, we use 30 % of data points for U0 and set M=14. Both splitting methods improved performance significantly.

The objective of the last problem, Chess, is to predict chess endgame outcomes—either “white can win” or “white cannot win”. This data set includes n=3196 instances, where each instance is a board-description for the chess endgame. There are p=36 attributes describing the board. For standard HMC, we set L=20 and ε=0.09. For the two Split HMC methods with normal approximation and data splitting, we reduced L to 9 and 2 respectively. For the data splitting method, we use 35 % of the data points for U0 and set M=15. Using the Split HMC methods, the computational efficiency is improved substantially compared to standard HMC. This time however, the normal approximation approach performs better than the data splitting method in terms of τ×g, τ×s, and τβ×s, while the latter performs better in terms of τβ×g.

5 Discussion

We have proposed two new methods for improving the efficiency of HMC, both based on splitting the Hamiltonian in a way that allows much of the movement around the state space to be performed at low computational cost.

While we demonstrated our methods on binary logistic regression models, they can be extended to multinomial logistic (MNL) models for multiple classes. For MNL models, the regression parameters for p covariates and J classes form a matrix of (p+1) rows and J columns, which we can regard as a vector of (p+1)×J elements. For Split HMC with normal approximation, we can define U0(θ) using an approximate multivariate normal \(N(\hat{\theta}, \mathcal{J}^{-1}(\hat{\theta}))\) as before. For Split HMC with data splitting, we can still construct U0(θ) using a small subset of data, based on the class probabilities for each data item found using the MAP estimates for the parameters (the best way of doing this is a subject for future research). The data splitting method could be further extended to any model for which it is feasible to find a MAP estimate, and then divide the data into two parts based on “residuals” of some form.

Although in theory our method can be used for many statistical models, its usefulness is of course limited by how well the posterior distribution can be approximated by a Gaussian distribution in Algorithm 1, and how well the gradient of the energy function can be approximated using a small but influential subset of data in Algorithm 2. For example, Algorithm 1 might not perform well for neural network models, for which the posterior distribution is usually multimodal. When using neural networks classification models, one could however use Algorithm 2 selecting a small subset of data using a simple logistic regression model. This could be successful when a linear model performs reasonably well, even if the optimal decision boundary is nonlinear.

The scope of Algorithm 1 proposed in this paper might be broadened by finding better methods to approximate the posterior distribution, such as variational Bayes methods. Future research could involve finding tractable approximations to the posterior distribution other than normal distributions. Also, one could investigate other methods for splitting the Hamiltonian dynamics by splitting the data—for example, fitting a support vector machine (SVM) to binary classification data, and using the support vectors for constructing U0.

While the results on simulated data and real problems presented in this paper have demonstrated the advantages of splitting the Hamiltonian dynamics in terms of improving the sampling efficiency, our proposed methods do require preliminary analysis of data, mainly, finding the MAP estimate. As mentioned above, the performance of our approach obviously depends on how well the corresponding normal distribution based on MAP estimates approximates the posterior distribution, or how well a small subset of data found using this MAP estimate captures the overall patterns in the whole data set. Moreover, this preliminary analysis involves some computational overhead. For many problems, however, the computational cost associated with finding the MAP estimate is negligible compared to the potential improvement in sampling efficiency for the full Bayesian model. For most of the examples discussed here, the additional computational cost is less than a second. Of course, there are situations for which finding the MAP estimate could be an issue; this is especially true for high dimensional problems. For such cases, it might be more practical to use Algorithm 2 after selecting a small but influential subset of data based on probabilities found using a simpler model. For the neural network example discussed above, we can use a simple logistic regression model with maximum likelihood estimates to select the data points for U0.

Although the normal approximations have been used for Bayesian inference in the past (see for example, Tierney and Kadane 1986), we use it for exploring the parameter space more efficiently while sampling from the exact distribution. One could of course use the approximate normal (Laplace) distribution as a proposal distribution in a Metropolis-Hastings algorithm. Using this approach however the acceptance rates drop substantially (below 10 %) for our examples.

Another approach to improving HMC has recently been proposed by Girolami and Calderhead (2011). Their method, Riemannian Manifold HMC (RMHMC), can also substantially improve performance. RMHMC utilizes the geometric properties of the parameter space to explore the best direction, typically at higher computational cost, to produce distant proposals with high probability of acceptance. In contrast, our method attempts to find a simple approximation to the Hamiltonian to reduce the computational time required for reaching distant states. It is possible that these approaches could be combined, to produce a method that performs better than either method alone. The recent proposals of Hoffman and Gelman (2011) for automatic tuning of HMC could also be combined with our Split HMC methods.

Footnotes
1

Indeed, in this two-dimensional example, it is better to use Metropolis with a large proposal standard deviation, even though this leads to a low acceptance probability, because this also avoids a random walk. However, in higher-dimensional problems with more than one highly-confining direction, a large proposal standard deviation leads to such a low acceptance probability that this strategy is not viable.

 

Acknowledgements

B. Shahbaba is supported by the National Science Foundation, Grant No. IIS-1216045. R.M. Neal’s work is supported by the Natural Sciences and Engineering Research Council of Canada. He holds a Canada Research Chair in Statistics and Machine Learning.

Copyright information

© Springer Science+Business Media New York 2013