Introduction

In this paper we study Hamiltonian Monte Carlo (HMC) algorithms (Neal 2011) that are not based on the standard kinetic/potential splitting of the Hamiltonian.

The computational cost of HMC samplers mostly originates from the numerical integrations that have to be performed to get the proposals. If the target distribution has density proportional to \( \exp (-U(\theta ))\), \(\theta \in {\mathbb {R}}^d\), the differential system to be integrated is given by the Hamilton’s equations corresponding to the Hamiltonian function \(H(\theta ,p) = (1/2)p^TM^{-1}p+U(\theta )\), where \(p\sim {\mathcal {N}}(0,M)\) is the auxiliary momentum variable and M is the symmetric, positive definite mass matrix chosen by the user. In a mechanical analogy, H is the (total) energy, while \({\mathcal {T}}(p)=(1/2)p^TM^{-1}p\) and \(U(\theta )\) are respectively the kinetic and potential energies. The Störmer/leapfrog/Verlet integrator is the method of choice to carry out those integrations and is based on the idea of splitting (Blanes and Casas 2017), i.e. the evolution of \((\theta ,p)\) under H is simulated by the separate evolutions under \({\mathcal {T}}(p)\) and \(U(\theta )\) (kinetic/potential splitting). However \(H(\theta ,p) = {\mathcal {T}}(p)+U(\theta )\) is not the only splitting that has been considered in the literature. In some applications one may write \(H(\theta ,p) = H_0(\theta ,p)+U_1(\theta )\), with \(H_0(\theta ,p) = {\mathcal {T}}(p)+U_0(\theta )\), \(U(\theta ) = U_0(\theta )+U_1(\theta )\) and replace the evolution under H by the evolutions under \(H_0\) and \(U_1\) (Neal 2011). The paper (Shahbaba et al. 2014) investigated this possibility; two algorithms were formulated referred to there as “Leapfrog with a partial analytic solution” and “Nested leapfrog”. Both suggested algorithms were shown to outperform, in four logistic regression problems, HMC based on the standard leapfrog integrator.

In this article we reexamine \(H(\theta ,p) = H_0(\theta ,p)+U_1(\theta )\) splittings, in particular in the case where the equations for \(H_0\) can be integrated analytically (partial analytic solution) because \(U_0(\theta )\) is a quadratic function (so that \(\propto \exp (-U_0(\theta ))\) is a Gaussian distribution). When \(U_1\) is slowly varying, the splitting \(H=H_0+U_1\) is appealing because, to quote (Shahbaba et al. 2014), “only the slowly-varying part of the energy needs to be handled numerically and this can be done with a larger stepsize (and hence fewer steps) than would be necessary for a direct simulation of the dynamics”.

Our contributions are as follows:

  1. 1.

    In Section 3 we show, by means of a counterexample, that it is not necessarily true that, when \(H_0\) is handled analytically and \(U_1\) is small, the integration may be carried out with stepsizes substantially larger than those required by standard leapfrog. For integrators based on the \(H_0+U_1\) splitting, the stepsize may suffer from important stability restrictions, regardless of the size of \(U_1\).

  2. 2.

    In Section 4 we show that, by combining the \(H_0+U_1\) splitting with the idea of preconditioning the dynamics, that goes back at least to (Bennett 1975), it is possible to bypass the stepsize limitations mentioned in the preceding item.

  3. 3.

    We present an integrator (that we call RKR) for the \(H_0+U_1\) splitting that provides an alternative to the integrator tested in (Shahbaba et al. 2014) (that we call KRK).

  4. 4.

    Numerical experiments in the final Section 5, using the test problems in (Shahbaba et al. 2014), show that the advantages of moving from standard leapfrog HMC to the \(H_0+U_1\) splitting (without preconditioning) are much smaller than the advantages of using preconditioning while keeping the standard kinetic/potential splitting. The best performance is obtained when the \(H_0+U_1\) splitting is combined with the preconditioning of the dynamics. In particular the RKR integration technique with preconditioning decreases the computational cost by more than an order of magnitude in all test problems and all observables considered.

There are two appendices. In the first, we illustrate the use of the Bernstein-von Mises theorem (see e.g. section 10.2 in (van der Vaart 1998)) to justify the soundness of the \(H_0+U_1\) splitting. The second is devoted to presenting a methodology to discriminate between different integrators of the preconditioned dynamics for the \(H_0+U_1\) splitting; in particular we provide analyses that support the advantages of the RKR technique over its KRK counterpart observed in the experiments.

Preliminaries

Hamiltonian Monte Carlo

HMC is based on the observation that (Neal 2011; Sanz-Serna 2014), for each fixed \(T>0\), the exact solution map (flow) \((\theta (T),p(T))=\varphi _T(\theta (0),p(0))\) of the Hamiltonian system of differential equations in \({\mathbb {R}}^{2d}\)

$$\begin{aligned} \frac{d{\theta }}{d{t}}=\frac{\partial H}{\partial p}=M^{-1}p,\quad \quad \frac{d{p}}{d{t}}=-\frac{\partial H}{\partial \theta }=-\nabla U(\theta ), \end{aligned}$$
(1)

exactly preserves the density \( \propto \exp (-H(\theta ,p)) =\exp (-{\mathcal {T}}(p)-U(\theta )) \) whose \(\theta \)-marginal is the target \(\propto \exp (-U(\theta ))\), \(\theta \in {\mathbb {R}}^d\). In HMC, (1) is integrated numerically over an interval \(0\le t\le T\) taking as initial condition the current state \((\theta ,p)\) of the Markov chain; the numerical solution at \(t=T\) provides the proposal \((\theta ^\prime ,p^\prime )\) that is accepted with probability

$$\begin{aligned} a=\min \left\{ 1,e^{-\big (H(\theta ^\prime ,p^\prime )-H(\theta ,p)\big )}\right\} . \end{aligned}$$
(2)

This formula for the acceptance probability assumes that the numerical integration has been carried out with an integrator that is both symplectic (or at least volume preserving) and reversible. The difference \(H(\theta ^\prime ,p^\prime )-H(\theta ,p)\) in (2) is the energy error in the integration; it would vanish leading to \(a=1\) if the integration were exact.

Splitting

Splitting is the most common approach to derive symplectic integrators for Hamiltonian systems (Blanes and Casas 2017; Sanz-Serna and Calvo 1994). The Hamiltonian H of the problem is decomposed in partial Hamiltonians as \(H=H_1+H_2\) in such a way that the Hamiltonian systems with Hamiltonian functions \(H_1\) and \(H_2\) may both be integrated in closed form. When Strang splitting is used, if \(\varphi ^{[H_1]}_t,\varphi ^{[H_2]}_t\) denote the maps (flows) in \({\mathbb {R}}^{2d}\) that advance the exact solution of the partial Hamiltonians over a time-interval of length t, the recipe

$$\begin{aligned} \psi _{\epsilon }=\varphi ^{[H_1]}_{\epsilon /2}\circ \varphi ^{[H_2]}_{\epsilon }\circ \varphi ^{[H_1]}_{\epsilon /2}, \end{aligned}$$
(3)

defines the map that advances the numerical solution a timestep of length \(\epsilon >0\). The numerical integration to get a proposal may then be carried out up to time \(T=\epsilon L\) with the L-fold composition \(\Psi _T=\left( \psi _{\epsilon }\right) ^L\). Regardless of the choice of \(H_1\) and \(H_2\), (3) is a symplectic, time reversible integrator of second order of accuracy (Bou-Rabee and Sanz-Serna 2018).

Kinetic/potential splitting

The splitting \(H = H_1+H_2\), \(H_1= {\mathcal {T}}\), \(H_2=U\) gives rise, via (3), to the commonest integrator in HMC: the Störmer/leapfrog/velocity Verlet algorithm. The differential equations for the partial Hamiltonians \({\mathcal {T}}\), U and the corresponding solution flows are

$$\begin{aligned} \frac{d{}}{d{t}}\begin{pmatrix}\theta \\ p\end{pmatrix}&\!=\!\begin{pmatrix}0\\ -\nabla U(\theta )\end{pmatrix}\!\implies \!\varphi _{\epsilon }^{[U]}(\theta ,p)=(\theta ,p-\epsilon \nabla U(\theta )),\\ \frac{d{}}{d{t}}\begin{pmatrix}\theta \\ p\end{pmatrix}&\!=\!\begin{pmatrix}M^{-1}p\\ 0\end{pmatrix}\!\implies \! \varphi _{\epsilon }^{[{\mathcal {T}}]}(\theta ,p)=(\theta +\epsilon M^{-1}p,p). \end{aligned}$$

As a mnemonic, we shall use the word kick to refer to the map \(\varphi _{\epsilon }^{[U]}(\theta ,p)\) (the system is kicked so that the momentum p varies without changing \(\theta \)). The word drift will refer to the map \(\varphi _{\epsilon }^{[{\mathcal {T}}]}(\theta ,p)\) (\(\theta \) drifts with constant velocity). Thus one timestep of the velocity Verlet algorithm reads (kick-drift-kick).

$$\begin{aligned} \psi _{\epsilon }^{[KDK]}=\varphi _{\epsilon /2}^{[U]}\circ \varphi ^{[{\mathcal {T}}]}_{\epsilon }\circ \varphi _{\epsilon /2}^{[U]}. \end{aligned}$$

There is of course a position Verlet algorithm obtained by interchanging the roles of \({\mathcal {T}}\) and U. One timestep is given by a sequence drift-kick-drift (DKD). Generally the velocity Verlet (KDK) version is preferred (see (Bou-Rabee and Sanz-Serna 2018) for a discussion) and we shall not be concerned hereafter with the position variant.

With any integrator of the Hamiltonian equations, the length \(\epsilon L=T\) of the time interval for the integration to get a proposal has to be determined to ensure that the proposal is sufficiently far from the current step of the Markov chain, so that the correlation between successive samples is not too high and the phase space is well explored (Hoffman and Gelman 2014; Bou-Rabee and Sanz-Serna 2017). For fixed T, smaller stepsizes \(\epsilon \) lead to fewer rejections but also to larger computational cost per integration and it is known that HMC is most efficient when the empirical acceptance rate is around approximately \(65\%\) (Beskos et al. 2013).

Algorithm 1 describes the computation to advance a single step of the Markov chain with HMC based on the velocity Verlet (KDK) integrator. In the absence of additional information, it is standard practice to choose \(M = I\), the identity matrix. For later reference, we draw attention to the randomization of the timestep \(\epsilon \). As is well known, without such a randomization, HMC may not be ergodic (Neal 2011); this will happen for instance when the equations of motion (1) have periodic solutions and \(\epsilon L\) coincides with the period of the solution.

figure a

Alternative splittings of the Hamiltonian

Splitting \(H(\theta ,p)\) in its kinetic and potential parts as in Verlet is not the only meaningful possibility. In many applications, \(U(\theta )\) may be written as \(U_0(\theta )+U_1(\theta )\) in such a way that the equations of motion for the Hamiltonian function \(H_0(\theta ,p)=(1/2)p^TM^{-1}p+U_0(\theta )\) may be integrated in closed form and then one may split H as

$$\begin{aligned} H = H_0+U_1,\qquad \end{aligned}$$
(4)

as discussed in e.g. (Neal 2011; Shahbaba et al. 2014).

In this paper we focus on the important particular case where (see Section 5 and Appendix A)

$$\begin{aligned} U_0(\theta ) = \frac{1}{2} (\theta -\theta ^*)^T {\mathcal {J}} (\theta -\theta ^*), \end{aligned}$$
(5)

for some fixed \(\theta ^*\in {\mathbb {R}}^d\) and a constant symmetric, positive definite matrix \({\mathcal {J}}\). Restricting for the time being attention to the case where the mass matrix M is the identity (the only situation considered in (Shahbaba et al. 2014)), the equations of motion and solution flow for the Hamiltonian

$$\begin{aligned} H_0(\theta ,p)=\frac{1}{2}p^Tp+U_0(\theta ) \end{aligned}$$
(6)

are

$$\begin{aligned}&\frac{d{}}{d{t}}\begin{pmatrix}\theta \\ p\end{pmatrix}= \begin{pmatrix}0 &{} I\\ -{\mathcal {J}} &{} 0\end{pmatrix}\begin{pmatrix}\theta -\theta ^*\\ p\end{pmatrix},\nonumber \\&\quad \varphi _t^{[H_0]}(\theta ,p)=\exp \left( t\begin{pmatrix}0 &{} I\\ -{\mathcal {J}} &{} 0\end{pmatrix}\right) \begin{pmatrix}\theta -\theta ^*\\ p\end{pmatrix}+\begin{pmatrix}\theta ^*\\ 0\end{pmatrix}. \end{aligned}$$
(7)

If we write \({\mathcal {J}} = Z^TDZ\), with Z orthogonal and D diagonal with positive diagonal elements, then the exponential map in Eq. (7) is

$$\begin{aligned}&\exp \left( t\begin{pmatrix}0 &{} I\\ -{\mathcal {J}} &{} 0\end{pmatrix}\right) =\begin{pmatrix}Z^T &{} 0\\ 0 &{} Z^T\end{pmatrix}e^{t\Lambda }\begin{pmatrix}Z &{} 0\\ 0 &{} Z\end{pmatrix},\nonumber \\&\quad e^{t\Lambda }=\begin{pmatrix}\cos (t\sqrt{D}) &{} D^{-1/2}\sin (t\sqrt{D}) \\ -D^{1/2}\sin (t\sqrt{D}) &{} \cos (t\sqrt{D}) \end{pmatrix}. \end{aligned}$$
(8)

In view of the expression for \(\exp (t\Lambda )\), we will refer to the flow of \(H_0\) as a rotation.

Choosing in (3) \(U_1\) and \(H_0\) for the roles of \(H_1\) and \(H_2\) (or viceversa) gives rise to the integrators

$$\begin{aligned}&\psi _{\epsilon }^{[KRK]}=\varphi _{\epsilon /2}^{[U_1]}\circ \varphi _{\epsilon }^{[H_0]}\circ \varphi _{\epsilon /2}^{[U_1]},\nonumber \\&\quad \psi _{\epsilon }^{[RKR]}= \varphi _{\epsilon /2}^{[H_0]}\circ \varphi _{\epsilon }^{[U_1]}\circ \varphi _{\epsilon /2}^{[H_0]}, \end{aligned}$$
(9)

where one advances the solution over a single timestep by using a kick-rotate-kick (KRK) or rotate-kick-rotate (RKR) pattern (of course the kicks are based on the potential function \(U_1\)). The HMC algorithm with the KRK map in (9) is shown in Algorithm 2, where the prefix Uncond, to be discussed later, indicates that the mass matrix being used is \(M=I\). The algorithm for the RKR sequence in (9) is a slight reordering of a few lines of code and is not shown. Algorithm 2 (but not its RKR counterpart) was tested in (Shahbaba et al. 2014).Footnote 1

figure b

Since the numerical integration in Algorithm 2 would be exact if \(U_1\) vanished (leading to acceptance of all proposals), the algorithm is appealing in cases where \(U_1\) is “small” with respect to \(H_0\). In some applications, a decomposition \(U=U_0+U_1\) with small \(U_1\) may suggest itself. For a “general” U one may always define \(U_0\) by choosing \(\theta ^*\) to be one of the modes of the target \(\propto \exp (-U(\theta ))\) and \({\mathcal {J}}\) the Hessian of U evaluated at \(\theta ^\star \); in this case the success of the splitting hinges on how well U may be approximated by its second-order Taylor expansion \(U_0\) around \(\theta ^\star \). In that setting, \(\theta ^*\) would typically have to be found numerically by minimizing U. Also Z and D would typically be derived by numerical approximation, thus leading to computational overheads for Algorithm 2 not present in Algorithm 1. However, as pointed out in (Shahbaba et al. 2014), the cost of computing \(\theta ^\star \), Z and D before the sampling begins is, for the test problems to be considered in this paper, negligible when compared with the cost of obtaining the samples.

Nesting

When a decomposition \(U=U_0+U_1\), with \(U_1\) small, is available but the Hamiltonian system with Hamiltonian \(H_0 = {\mathcal {T}}+U_0\) cannot be integrated in closed form, one may still construct schemes based on the recipe (3). One step of the integrator is defined as

$$\begin{aligned} \varphi _{\epsilon /2}^{[U_1]}\circ \left( \varphi _{\epsilon /2k}^{[U_0]}\circ \varphi ^{[{\mathcal {T}}]}_{\epsilon /k}\circ \varphi _{\epsilon /2k}^{[U_0]}\right) ^{k}\circ \varphi _{\epsilon /2}^{[U_1]}, \end{aligned}$$
(10)

where k is a suitably large integer. Here the (untractable) exact flow of \(H_0\) is numerically approximated by KDK Verlet using k substeps of length \(\epsilon /k\). In this way, kicks with the small \(U_1\) are performed with a stepsize \(\epsilon /2\) and kicks with the large \(U_0\) benefit from the smaller stepsize \(\epsilon /(2k)\). This idea has been successfully used in Bayesian applications in (Shahbaba et al. 2014), where it is called “nested Verlet”. The small \(U_1\) is obtained summing over data points that contribute little to the loglikelihood and the contributions from the most significant data are included in \(U_0\).

Integrators similar to (10) have a long history in molecular dynamics, where they are known as multiple timestep algorithms (Tuckerman et al. 1992; Leimkuhler and Matthews 2015; Grubmüller et al. 1991).

Shortcomings of the unconditioned KRK and RKR samplers

As we observed above, Algorithm 2 is appealing when \(U_1\) is a small perturbation of the quadratic Hamiltonian \(H_0\). In particular, one would expect that since the numerical integration in Algorithm 2 is exact when \(U_1\) vanishes, then this algorithm may be operated with stepsizes \(\epsilon \) chosen solely in terms of the size of \(U_1\), independently of \(H_0\). If that were the case one would expect that Algorithm 2 may work well with large \(\epsilon \) in situations where Algorithm 1 requires \(\epsilon \) small and therefore much computational effort. Unfortunately those expectations are not well founded, as we shall show next by means of an example.

We study the model Hamiltonian with \(\theta ,p\in {\mathbb {R}}^2\) given by

$$\begin{aligned}&H(\theta ,p)= H_0(\theta ,p)+U_1(\theta ),\nonumber \\&\quad H_0 = \frac{1}{2}p^Tp+\frac{1}{2}\theta ^T\begin{pmatrix}\sigma _1^{-2}&{} 0\\ 0 &{} \sigma _2^{-2}\end{pmatrix}\theta ,\qquad U_1= \frac{\kappa }{2} \theta ^T\theta .\nonumber \\ \end{aligned}$$
(11)

The model is restricted to \({\mathbb {R}}^2\) just for notational convenience; the extension to \({\mathbb {R}}^d\) is straightforward. The quadratic Hamiltonian \(H_0\) is rather general—any Hamiltonian system with quadratic Hamiltonian \((1/2)p^TM^{-1}p+(1/2)\theta ^TW\theta \) may be brought with a change of variables to a system with Hamiltonian of the form \((1/2)p^Tp+(1/2)\theta ^TD\theta \), with MW symmetric, positive definite matrices and D diagonal and positive definite (Blanes et al. 2014; Bou-Rabee and Sanz-Serna 2018). In (11), \(\sigma _1\) and \(\sigma _2\) are the standard deviations of the bivariate Gaussian distribution with density \(\propto \exp (-U_0(\theta ))\) (i.e of the target in the unperturbed situation \(U_1=0\)). We choose the labels of the scalar components \(\theta _1\) and \(\theta _2\) of \(\theta \) to ensure \(\sigma _1\le \sigma _2\) so that, for the probability density \(\propto \exp (-U_0(\theta ))\), \(\theta _1\) is more constrained than \(\theta _2\). In addition, we assume that \(\kappa \) is small with respect to \(\sigma ^{-2}_1\) and \(\sigma ^{-2}_2\), so that in (11) \(U_1\) is a small perturbation of \(H_0\). The Hamiltonian equations of motion for \(\theta _i\), given by \(\frac{d{}}{d{t}} \theta _i = p_i\), \(\frac{d{}}{d{t}} p_i = -\omega _i^2 \theta _i\), with \(\omega _i=(\sigma _i^{-2}+\kappa )^{1/2}\approx \sigma _i^{-1}\), yield \(\frac{{d}^{2}}{{dt}^{2}} \theta _i+\omega _i^2 \theta _i =0\). Thus the dynamics of \(\theta _1\) and \(\theta _2\) correspond to two uncoupled harmonic oscillators; the component \(\theta _i\), \(i=1,2\), oscillates with an angular frequency \(\omega _i\) (or with a period \(2\pi /\omega _i\)).

We note, regardless of the integrator being used, the correlation between the proposal and the current state of the Markov chain will be large if the integration is carried out over a time interval \(T = \epsilon L\) much smaller than the periods \(2\pi /\omega _i\) of the harmonic oscillators (Neal 2011; Bou-Rabee and Sanz-Serna 2017). Since \(2\pi /\omega _2\) is the longest of the two periods, L has then to be chosen

$$\begin{aligned} L \ge \frac{C}{\epsilon \omega _2} \approx \frac{C\sigma _2}{\epsilon }, \end{aligned}$$
(12)

where C denotes a constant of moderate size. For instance, for the choice \(C = \pi /2\), the proposal for \(\theta _2\) is uncorrelated at stationarity with the current state of the Markov chain as discussed in e.g. (Bou-Rabee and Sanz-Serna 2017).

For the KDK Verlet integrator, it is well known that, for stability reasons (Neal 2011; Bou-Rabee and Sanz-Serna 2018), the integration has to be operated with a stepsize \(\epsilon < 2/ \mathrm{max}(\omega _1,\omega _2)\), leading to a stability limit

$$\begin{aligned} \epsilon \approx 2\sigma _1; \end{aligned}$$
(13)

integrations with larger \(\epsilon \) will lead to extremely inaccurate numerical solutions. This stability restriction originates from \(\theta _1\), the component with greater precision in the Gaussian distribution \(\propto \exp (-U_0)\). Combining (13) with (12) we conclude that, for Verlet, the number of timesteps L has to be chosen larger than a moderate multiple of \(\sigma _2/\sigma _1\). Therefore when \(\sigma _1\ll \sigma _2\) the computational cost of the Verlet integrator will necessarily be very large. Note that the inefficiency arises when the sizes of \(\sigma _1\) and \(\sigma _2\) are widely different; the first sets an upper bound for the stepsize and the second a lower bound on the length \(\epsilon L\) of the integration interval. Small or large values of \(\sigma _1\) and \(\sigma _2\) are not dangerous per se if \(\sigma _2/\sigma _1\) is moderate.

We now turn to the KRK integrator in (9). For the i-th scalar component of \((\theta ,p)\), a timestep of the KRK integrator reads

$$\begin{aligned}&\begin{pmatrix}\theta _i\\ p_{i}\end{pmatrix} \leftarrow \begin{pmatrix} 1 &{} 0\\ -{\epsilon \kappa }/{2} &{} 1 \end{pmatrix} \begin{pmatrix} \cos (\epsilon /\sigma _i) &{} \sigma _i\sin (\epsilon /\sigma _i)\\ -\sigma _i^{-1}\sin (\epsilon /\sigma _i) &{} \cos (\epsilon /\sigma _i) \end{pmatrix}\\&\quad \begin{pmatrix} 1 &{} 0\\ -{\epsilon \kappa }/{2} &{} 1 \end{pmatrix} \begin{pmatrix}\theta _i\\ p_{i}\end{pmatrix} \end{aligned}$$

or

$$\begin{aligned}&\begin{pmatrix}\theta _i\\ p_{i}\end{pmatrix} \leftarrow \begin{pmatrix} \cos (\epsilon /\sigma _i) -(\epsilon \sigma _i\kappa /2)\sin (\epsilon /\sigma _i)&{} \sigma _i\sin (\epsilon /\sigma _i)\\ (\epsilon ^2\sigma _i\kappa ^2/4)\sin (\epsilon /\sigma _i)-\sigma _i^{-1}\sin (\epsilon /\sigma _i)+\epsilon \kappa \cos (\epsilon /\sigma _i)) &{} \cos (\epsilon /\sigma _i) -(\epsilon \sigma _i\kappa /2)\sin (\epsilon /\sigma _i) \end{pmatrix} \begin{pmatrix}\theta _i\\ p_{i}\end{pmatrix}. \end{aligned}$$

Stability is equivalent to \(|\cos (\epsilon /\sigma _i) -(\epsilon \sigma _i\kappa /2)\sin (\epsilon /\sigma _i)|<1\), which, for \(\kappa > 0\), gives \( 2\cot ((\epsilon /(2\sigma _i))>\epsilon \kappa \sigma _i \). From here it is easily seen that stability in the i-th component is lost for \(\epsilon / \sigma _i\approx \pi \) for arbitrarily small \(\kappa > 0\). Thus the KRK stability limit is

$$\begin{aligned} \epsilon \approx \pi \sigma _1. \end{aligned}$$
(14)

While this is less restrictive than (13), we see that stability imposes an upper bound for \(\epsilon \) in terms of \(\sigma _1\), just as for Verlet. From (12), the KRK integrator, just like Verlet, will have a large computational cost when \(\sigma _1\ll \sigma _2\). This is in spite of the fact that the integrator would be exact for \(\kappa =0\), regardless of the values of \(\sigma _1\), \(\sigma _2\).

For the RKR integrator a similar analysis shows that the stability limit is also given by (14); therefore that integrator suffers from the same shortcomings as KRK.

We also note that, since as k increases the nested integrator (10) approximates the KRK integrator, the counterexample above may be used to show that the nested integrator has to be operated with a stepsize \(\epsilon \) that is limited by the smallest standard deviations present in \(U_0\), as is the case for Verlet, KRK and RKR. For the stability of (10) and related multiple timestep techniques, the reader is referred to (García-Archilla et al. 1998) and its references. The nested integrator will not be considered further in this paper.

Preconditioning

As pointed out above, without additional information on the target, it is standard to set \(M = I\). When \(U =U_0+U_1\), with \(U_0\) as in (5), it is useful to consider a preconditioned Hamiltonian with \(M = {\mathcal {J}}\):

$$\begin{aligned}&H^{[precond]}(\theta ,p)=\frac{1}{2}p^T{\mathcal {J}}^{-1}p+U(\theta )\nonumber \\&=\frac{1}{2}p^T{\mathcal {J}}^{-1}p+\frac{1}{2}(\theta -\theta ^*)^T{\mathcal {J}}(\theta -\theta ^*)+U_1(\theta ) . \end{aligned}$$
(15)

Preconditioning is motivated by the observation that the equations of motion for the Hamiltonian

$$\begin{aligned} H_0^{[precond]}(\theta ,p)=\frac{1}{2}p^T{\mathcal {J}}^{-1}p +\frac{1}{2}(\theta -\theta ^*)^T{\mathcal {J}}(\theta -\theta ^*), \end{aligned}$$

given by \(\frac{d{}}{d{t}} \theta = {\mathcal {J}}^{-1} p\), \(\frac{d{}}{d{t}} p = -{\mathcal {J}}(\theta -\theta ^\star )\), yield \(\frac{{d}^{2}}{{dt}^{2}} (\theta -\theta ^\star ) + (\theta -\theta ^\star )=0\). Thus we now have d uncoupled scalar harmonic oscillators (one for each scalar component \(\theta _i-\theta _i^\star \)) sharing a common oscillation frequency \(\omega =1\).Footnote 2 This is to be compared with the situation for (6), where, as we have seen in the model (11), the frequencies are the reciprocals \(1/\sigma _i\) of the standard deviations of the distribution \(\propto \exp (-U_0(\theta ))\). Since, as we saw in Section 3, it is the differences in size of the frequencies of the harmonic oscillators that cause the inefficiency of the integrators, choosing the mass matrix to ensure that all oscillators have the same frequency is of clear interest. We call unconditioned those Hamiltonians/integrators where the mass matrix is chosen as the identity agnostically without specializing it to the problem.

For reasons explained in (Beskos et al. 2011) it is better, when \({\mathcal {J}}\) has widely different eigenvalues, to numerically integrate the preconditioned equations of motion after rewriting them with the variable \(v =M^{-1}p = {\mathcal {J}}^{-1}p\) replacing p. The differential equations and solution flows of the subproblems are then given by

$$\begin{aligned}&\frac{d{}}{d{t}}\begin{pmatrix}\theta \\ v\end{pmatrix}=\begin{pmatrix}0\\ -{\mathcal {J}}^{-1}\nabla _\theta U_1(\theta )\end{pmatrix}\\&\quad \implies \varphi _t^{[U_1]}(\theta ,v)=\begin{pmatrix}\theta \\ v-t{\mathcal {J}}^{-1}\nabla _\theta U_1(\theta )\end{pmatrix}, \end{aligned}$$

and

$$\begin{aligned}&\frac{d{}}{d{t}}\begin{pmatrix}\theta \\ v\end{pmatrix}=\begin{pmatrix}0 &{} I\\ -I &{} 0\end{pmatrix}\begin{pmatrix}(\theta -\theta ^*)\\ v\end{pmatrix}\\&\quad \implies \varphi _t^{[H^{[precond]}_0]}(\theta ,v)\\&\qquad =\begin{pmatrix}\cos (t) &{} \sin (t)\\ -\sin (t) &{} \cos (t)\end{pmatrix}\begin{pmatrix}(\theta -\theta ^*)\\ v\end{pmatrix}+\begin{pmatrix}\theta ^*\\ 0\end{pmatrix}. \end{aligned}$$

Since \({\mathcal {J}}\) is a symmetric, positive definite matrix, it admits a Cholesky factorisation \({\mathcal {J}}=BB^T\). The inversion of \({\mathcal {J}}\) in the kick may thus be performed efficiently using Cholesky-based solvers from standard linear algebra libraries. It also means it is easy to draw from the distribution of \(v\sim B^{-T}{\mathcal {N}}(0,I)\).

Composing the exact maps \(\varphi _{\epsilon }^{[.]}\) using Strang’s recipe (3) then gives a numerical one-step map \(\psi _\epsilon ^{[.]}\) in either an RKR or KRK form. The preconditioned KRK (PrecondKRK) algorithm is shown in Algorithm 3; the RKR version is similar and will not be given.

figure c

Of course it is also possible to use the KDK Verlet Algorithm 1 with preconditioning (\(M= {\mathcal {J}}\)) (and v replacing p). The resulting algorithm may be seen in Algorithm 4.

figure d
Fig. 1
figure 1

Autocorrelation function plots for the slowest moving component associated to the IAC \(\tau _{\max }\) for each dataset. For the unconditioned methods, we show the principled choice B (solid line) and the choice A from (Shahbaba et al. 2014) (dotted). The values of \(\epsilon \) and T are as given in the tables

Applying these algorithms to the model problem (11), an analysis parallel to that carried out in Section 3 shows that the decorrelation condition (12) becomes, independently of \(\sigma _1\) and \(\sigma _2\)

$$\begin{aligned} L \gtrapprox C/ \epsilon \end{aligned}$$

and the stability limits in (13) and (14) are now replaced, also independently of the values of \(\sigma _1\) and \(\sigma _2\), by

$$\begin{aligned} \epsilon \approx 2, \qquad \epsilon \approx \pi , \end{aligned}$$

for Algorithm 4 and Algorithm 3 respectively. The stability limit for the PrecondRKR algorithm coincides with that of the PrecondKRK method. (See also Appendix B.)

The idea of preconditioning is extremely old; to our best knowledge it goes back to (Bennett 1975). The algorithm in (Girolami and Calderhead 2011) may be regarded as a \(\theta \)-dependent preconditioning. For preconditioning in infinite dimensional problems see (Beskos et al. 2011).

Numerical results

In this section we test the following algorithms:

  • Unconditioned Verlet: Algorithm 1 with \(M=I\).

  • Unconditioned KRK: Algorithm 2.

  • Preconditioned Verlet: Algorithm 4.

  • Preconditioned KRK: Algorithm 3.

  • Preconditioned RKR: similar to Algorithm 3 using a rotate-kick-rotate pattern instead of kick-rotate-kick.

The first two algorithms were compared in (Shahbaba et al. 2014) and in fact we shall use the exact same logistic regression test problems used in that reference. If x are the prediction variables and \(y\in \{0,1\}\), the likelihood for the test problems is (\(\widetilde{{x}}=\left[ 1,{x}^T\right] ^T,{\theta }=\left[ \alpha ,{\beta }^T\right] ^T\))

$$\begin{aligned} {\mathcal {L}}(\theta ;x,y)\!=\!\prod _{i=1}^n\left( 1+\exp (-{\theta }^T \widetilde{{x}}_i)\right) ^{-y}\left( 1+\exp ({\theta }^T\widetilde{{x}}_i)\right) ^{y-1}.\nonumber \\ \end{aligned}$$
(16)

For the preconditioned integrators, we set \(U_0\) as in (5) with \(\theta ^*\) given by the maximum a posteriori (MAP) estimation and \({\mathcal {J}}\) the Hessian at \(\theta ^*\).

Table 1 SimData: For methods labelled A, parameters from (Shahbaba et al. 2014): \(T=0.3\), \({\bar{\epsilon }}_{Verlet}=0.015\), \({\bar{\epsilon }}_{UKRK}=0.03\). For the unconditioned methods labelled B, \(T=\pi /2\omega _{\min }=0.6\), and \({\bar{\epsilon }}_{Verlet}=0.015\), \({\bar{\epsilon }}_{UKRK}=0.03\). For the preconditioned methods, \(T=\pi /2\), and \({\bar{\epsilon }}_{Verlet}=T/3\approx 0.52\); the other preconditioned methods operate with \({\bar{\epsilon }}_{Precon}=T\approx 1.57\)

For the two unconditioned integrators, we run the values of L and \(\epsilon \) chosen in (Shahbaba et al. 2014) (this choice is labelled as A in the tables). Since in many cases the autocorrelation for the unconditioned methods is extremely large with those parameter values (see Fig. 1), we also present results for these methods with a principled choice of T and \(\epsilon \) (labelled as B in the tables). We take \(T=\epsilon L={\pi }/({2\omega _{\min }})\), where \(\omega _{\min }\) is the minimum eigenvalue of \(\sqrt{D}\) given in Eq. (8). In the case where the perturbation \(U_1\) is absent, this choice of T would decorrelate the least constrained component of \(\theta \). We then set \(\epsilon \) as large as possible to ensure an acceptance rate above 65% (Beskos et al. 2013)— the stepsizes in the choice B are slightly smaller than the values used in (Shahbaba et al. 2014), and the durations T are, for every dataset, larger. We are able thus to attain greater decorrelation, although at greater cost. For the preconditioned methods, we set \(T=\pi /2\), since this gives samples with 0 correlation in the case \(U_1=0\), and then set the timestep \(\epsilon \) as large as possible whilst ensuring the acceptance rate is above 65%.

In every experiment we start the chain from the (numerically calculated) MAP estimate \(\theta ^\star \) of \(\theta \) and acquire \(N_s=5\times 10^4\) samples. The autocorrelation times reported are calculated using the emcee function integrated_time with the default value \(c=5\) (Foreman-Mackey et al. 2013). We also estimated autocorrelation times using alternative methods (Geyer 1992; Neal 1993; Sokal 1997; Thompson 2010); the results obtained do not differ significantly from those reported in the tables.

Finally, note that values of \({{\bar{\epsilon }}}\) quoted in the tables are the maximum timestep that the algorithms operate with, since the randomisation follows \(\epsilon \sim {\bar{\epsilon }}\times {\mathcal {U}}_{[0.8,1]}\). All code is available from the github repository https://github.com/lshaw8317/SplitHMCRevisited.

Table 2 StatLog: For methods labelled A, parameters are from (Shahbaba et al. 2014): \(T=1.6\), \({\bar{\epsilon }}_{Verlet}=0.08\), \({\bar{\epsilon }}_{UKRK}=0.114\). For the unconditioned methods labelled B, \(T=\pi /2\omega _{\min }=3.26\), and \({\bar{\epsilon }}_{Verlet}=0.08\), \({\bar{\epsilon }} _{UKRK}= 0.114\). For the preconditioned methods, \(T=\pi /2\), and \({\bar{\epsilon }}_{Verlet}=T/3\); the other preconditioned methods operate with \({\bar{\epsilon }}_{Precon}=T/2\)
Table 3 CTG: For runs labelled A, parameters are from (Shahbaba et al. 2014): \(T=1.6\), \({\bar{\epsilon }}_{Verlet}=0.08\), \({\bar{\epsilon }}_{UKRK}=0.123\). For the unconditioned runs labelled B, \(T=\pi /2\omega _{\min }=7.85\), and \({\bar{\epsilon }}_{Verlet}=0.08\), \({\bar{\epsilon }}_{UKRK}=0.118\). For the preconditioned methods, \(T=\pi /2\), and \({\bar{\epsilon }}=T/2\)
Table 4 Chess: For runs labelled A, parameters are from (Shahbaba et al. 2014): \(T=1.8\), \({\bar{\epsilon }}_{Verlet}=0.09\), \({\bar{\epsilon }}_{UKRK}=0.2\). For runs labelled B, \(T=\pi /2\omega _{\min }=5.71\), and \({\bar{\epsilon }}_{Verlet}=0.087\), \({\bar{\epsilon }}_{UKRK}=0.142\). For the preconditioned methods, \(T=\pi /2\), and \({\bar{\epsilon }}=T/2\)

Simulated data

We generate simulated data according to the same procedure and parameter values described in (Shahbaba et al. 2014). The first step is to generate \({x}\sim {\mathcal {N}}(0,{\sigma }^2)\) with \({\sigma }^2=\mathrm {diag}\left\{ \sigma _j^2: j=1\ldots ,d-1\right\} \), where

$$\begin{aligned} \sigma ^2_j={\left\{ \begin{array}{ll} 25 &{} j\le 5\\ 1 &{} 5<j\le 10\\ 0.04 &{} j>10 \end{array}\right. }. \end{aligned}$$

Then, we generate the true parameters \(\hat{{\theta }}=[\alpha ,{\beta }^T]^T\) with \(\alpha \sim {\mathcal {N}}(0,\gamma ^2)\) and the vector \({\beta }\in {\mathbb {R}}^{d-1}\) with independent components following \(\beta _j\sim {\mathcal {N}}(0,\gamma ^2),j=1,\ldots ,d-1\), with \(\gamma ^2=1\). Augmenting the data \(\widetilde{{x}}_i=[1,{x}_i^T]^T\), from a given sample \({x}_i\), \(y_i\) is then generated as a Bernoulli random variable \(y_i\sim {\mathcal {B}}((1+\exp (-\hat{{\theta }}^T\widetilde{{x}}_i))^{-1})\). In concreteness, a simulated data set \(\{{x}_i,y_i\}_{i=1}^n\) with \(n=10^4\) samples is generated, \({x}_i\in {\mathbb {R}}^{d-1}\) with \(d-1=100\). The sampled parameters \({\theta }\in {\mathbb {R}}^d\) are assumed to have a prior \({\mathcal {N}}(0,\Sigma )\) with \(\Sigma =\mathrm {diag}\left\{ 25: j=1\ldots ,d\right\} \).

Results are given in Table 1. The second column gives the number L of timesteps per proposal and the third the computational time s (in milliseconds) required to generate a single sample. The next columns give, for three observables, the products \(\tau \times s\), with \(\tau \) the integrated autocorrelation (IAC) time. These products measure the computational time to generate one independent sample. The notation \(\tau _\ell \) refers to the observable \(f(\theta )=\log ({\mathcal {L}}(\theta ;x,y))\) where \({\mathcal {L}}\) is the likelihood in (16), and \(\tau _{\theta ^2}\) refers to \(f(\theta )=\theta ^T\theta \). The degree of correlation measured by \(\tau _{\ell }\) is important in optimising the cost-accuracy ratio of predictions of y, while \(\tau _{\theta ^2}\) is relevant to estimating parameters of the distribution of \(\theta \) (Andrieu et al. 2003; Gelman et al. 2015). Following (Shahbaba et al. 2014), we also examine the maximum IAC over all the Cartesian components of \({\theta }\), since we set the time T in order to decorrelate the slowest-moving/least constrained component. Finally the last column provides the observed rate of acceptance.

Comparing the values of \(\tau \times s\) in the first four rows of the table shows the advantage, emphasized in (Shahbaba et al. 2014), of the \(H_0+U_1\) (4) over the kinetic/potential splitting: Unconditioned KRK operates with smaller values of L than Unconditioned Verlet and the values of \(\tau \times s\) are smaller for Unconditioned KRK than for unconditioned Verlet. However when comparing the results for Unconditioned Verlet A or B with those for Preconditioned Verlet, it is apparent that the advantage of using the Hessian \({\mathcal {J}}\) to split \(U=U_0+U_1\) with \(M=I\) is much smaller than the advantage of using \({\mathcal {J}}\) to precondition the integration while keeping the kinetic/potential splitting.

The best performance is observed for the Preconditioned KRK and RKR algorithms that avail themselves of the Hessian both to precondition and to use rotation instead of drift. Preconditioned RKR is clearly better than its KRK counterpart (see Appendix B). For this problem, as shown in Appendix A, \(U_1\) is in fact small and therefore the restrictions of the stepsize for the KRK integration are due to the stability reasons outlined in Section 3. In fact, for the unconditioned algorithms, the stepsize \({{\bar{\epsilon }}}_{KRK}=0.03\) is not substantially larger than \({{\bar{\epsilon }}}_{Verlet}=0.015\), in agreement with the analysis presented in that section.

The need to use large values of L in the unconditioned integration stems, as discussed above, from the coexistence of large differences between the frequencies of the harmonic oscillators. In this problem the minimum and maximum frequencies are \(\omega _{\min }=2.6,\omega _{\max }=105.0\).

Real data

The three real datasets considered in (Shahbaba et al. 2014), StatLog, CTG and Chess, are also examined, see Tables 2–4. For the StatLog and CTG datasets with the unconditioned Hamiltonian, KRK does not really provide an improvement on Verlet. In all three datasets, the preconditioned integrators clearly outperform the unconditioned counterparts. Of the three preconditioned algorithms Verlet is the worst and RKR the best.

StatLog

Here, \(n=4435\), \(d-1=36\). The frequencies are \(\omega _{\min }=0.5,\omega _{\max }=22.8\).

CTG

Here, \(n=2126\), \(d-1=21\). The frequencies are \(\omega _{\min }=0.2,\omega _{\max }=23.9\).

Chess

Here, \(n=3196\), \(d-1=36\). The frequencies are \(\omega _{\min }=0.3,\omega _{\max }=22.3\).