# Ensemble preconditioning for Markov chain Monte Carlo simulation

- 1.9k Downloads

## Abstract

We describe parallel Markov chain Monte Carlo methods that propagate a collective ensemble of paths, with local covariance information calculated from neighbouring replicas. The use of collective dynamics eliminates multiplicative noise and stabilizes the dynamics, thus providing a practical approach to difficult anisotropic sampling problems in high dimensions. Numerical experiments with model problems demonstrate that dramatic potential speedups, compared to various alternative schemes, are attainable.

## Keywords

Stochastic sampling Markov chain Monte Carlo MCMC Computational statistics Machine learning BFGS Langevin methods Brownian dynamics## 1 Introduction

A popular family of methods for Bayesian parameterization in data analytics are derived as Markov chain Monte Carlo (MCMC) methods, including Hamiltonian (or hybrid) Monte Carlo (HMC)(Duane et al. 1987; Neal 2011; Monnahan et al. 2016), or the Metropolis adjusted Langevin algorithm (MALA)(Rossky et al. 1978; Bou-Rabee and Vanden-Eijnden 2010; Roberts and Tweedie 1996). These methods involve proposals that are based on approximating a continuous-time (stochastic) dynamics that exactly preserves the target (posterior) density \(\pi \), followed by an accept/reject step to correct for approximation errors.

Efficient parameterization of the stochastic differential equations used in these procedures has the potential to greatly accelerate their convergence, particularly when the target density is poorly scaled, i.e. when the Hessian matrix of the logarithm of the density has a large condition number (an example is given in “Appendix 1”). In precise analogy with well-established strategies in optimization (see e.g. Sun and Yuan 2006), the solution to conditioning problems in the sampling context is to find a well-chosen change of variables (preconditioning) for the system, such that the natural scales of the transformed system are roughly commensurate.

In this article, we discuss an approach to dynamic preconditioning based on simultaneously evolving an ensemble of parallel MCMC simulations, each of which is referred to as a “walker” or “particle”. As we will show, the walkers provide information that can greatly improve the efficiency of MCMC methods. There is a long history of using multiple parallel simulations to improve MCMC calculations (see e.g. (Gilks et al. 1994; ter Braak 2006; Goodman and Weare 2010; Andrés Christen and Fox 2010; Jasra et al. 2007; Cappé et al. 2004; Iba 2001; Hairer and Weare 2014; Hammersley and Morton 1954; Liu 2002; Rosenbluth and Rosenbluth 1955)). Many of these methods rely on occasional duplication or removal of walkers and reweighting of samples to speed sampling of densities with multiple modes or to compute tail averages. The schemes proposed in this article are more similar to methods introduced in (Gilks et al. 1994; ter Braak 2006; Andrés Christen and Fox 2010; Goodman and Weare 2010) that address conditioning issues using walker proposal moves informed by the positions of other walkers in the ensemble. These methods are not designed to directly address multimodality and do not involve any reweighting of samples. Our approach differs in that proposal moves are derived from time discretization of an SDE whose solutions exactly preserve \(\pi \) (or more precisely the joint density of an ensemble of independent random variables drawn from \(\pi \)). This results in ensemble MCMC schemes that converge rapidly on poorly conditioned distributions even in relatively high-dimensional sample spaces and when the details of the conditioning problems depend on position in sample space.

*J*(

*x*) and

*S*(

*x*) are skew-symmetric and symmetric positive semi-definite \(D\times D\) matrices, respectively, with \(\eta (t)\) representing a vector of independent Gaussian white noise components. In our sampling schemes, each walker generates a discrete-time approximation of (1) with its own particular choice of

*J*which corresponds to a notion of the localized and regularized sample covariance matrix across the ensemble of walkers and incorporates information about the target density \(\pi \) into the evolution of each walker.

Many existing sampling methods can be characterized as time discretizations of (1) (Ma et al. 2015). The matrix *S* is sometimes referred to as a mass matrix (though we reserve that term for a different matrix) and is often chosen to be diagonal. More general modifications of *S* (with \(J=0\)) to improve convergence have been considered in the Monte Carlo literature, dating at least to (Bennett 1975). This idea has been the focus of renewed attention in statistics, and several recent approaches concerning this or related ideas have been proposed (Martin et al. 2012; Girolami and Calderhead 2011a). Though modification of *S* appears to be much more common in practice, several authors have considered the effect that the choice of *J* and *S* has on the ergodic properties of the solution to (1) from a more theoretical perspective (see e.g. (Rey-Bellet and Spiliopoulos 2015; Duncan et al. 2016; Hwang et al. 2005, 1993)). In this paper, we are concerned with motivating and presenting a particular choice of *S* and *J* based on the ensemble framework mentioned above and yielding practical and efficient sampling schemes. We demonstrate that the choice of *J* and *S* has important ramifications for the stability of the discretization scheme as well as for the overall sampling efficiency. This interplay will be explored in future work

## 2 Preconditioning strategies for sampling

*N*. In many cases, we can expect the error in an MCMC scheme to satisfy a central limit theorem: \( \sqrt{N}\left( \overline{f}_N - \mathrm {E}[f] \right) \xrightarrow {dist} N(0, \tau \sigma ^2), \) where \(\sigma ^2\) is the variance of

*f*under \(\pi \) (and is independent of the particular MCMC scheme), the \(\tau \) the integrated autocorrelation time (IAT) which is often used to quantify the efficiency of an MCMC approach (see “Appendix 1”).

When \(\pi \) is Gaussian with covariance \({\varSigma }\), one can easily show that the cost to achieve a fixed accuracy depends on the condition number \(\kappa = \lambda _\mathrm{max}/\lambda _\mathrm{min}\) where \(\lambda _\mathrm{max}\) and \(\lambda _\mathrm{min}\) are the largest and smallest eigenvalues of \({\varSigma }\). Indeed, one finds that the worst-case IAT \(\tau \) for the scheme in (2) over observables of the form \(v^\text { T} x\) is \(\tau = \kappa -1\) (see “Appendix 1”). In this formula, the eigenvalue \(\lambda _{min}\) arises due to the discretization stability constraint on the stepsize parameter \({\delta t}\) and \(\lambda _\mathrm{max}\) appears because the direction of the corresponding eigenvector is slowest to relax for the continuous-time process. The presence of \(\lambda _\mathrm{min}\) in this formula indicates that analysis of the continuous-time scheme (1) (i.e. neglect of the discretization stability constraint) can be misleading when considering the effects of poor conditioning on sampling efficiency. Since the central limit theorem suggests that the error after *N* steps of the scheme is roughly proportional to \( \sqrt{\tau /N}\), the cost to achieve a fixed accuracy is again roughly proportional to \(\kappa \).

*A*or the vector

*v*. As the performance is independent of the choice of

*A*and

*v*, we can assume that

*A*or

*v*is chosen to improve the conditioning of the problem.

Due to the presence of the divergence term in the continuous dynamics, discretization will require evaluation of first-, second- and third-order derivatives of \(\log (\pi (x))\), making it prohibitively expensive for many models. To avoid this difficulty, one can estimate the divergence term using an extra evaluation of the Hessian (see “Appendix 6”), or omit the divergence term and rely on a Metropolization step to ensure correct sampling. Regardless of how this term is handled, the system (3), unlike (2), is based on multiplicative noise (where the magnitude of the noise process depends upon the state of the system) which is known to introduce complexity (and reduce accuracy) in numerical discretization (Milstein and Tretyakov 2004).

*S*does not depend on position, the scheme can be expected to perform poorly on problems for which the conditioning is dramatically different in different regions of space (e.g. the Hessian has high condition number and its eigenvectors are strongly position dependent), see Fig. 1. These observations suggest a choice of

*S*corresponding to a notion of local covariance.

While a notion of local covariance will be central to the schemes we eventually introduce, we choose to incorporate that information not through *S* in (1), but through the skew-symmetric matrix *J* in that equation. In the remainder of this section, we discuss how the choices of *S* described so far, and the corresponding properties of (3), have analogues in choices of *J* and a family of so-called underdamped Langevin schemes that we next introduce Pavliotis (2014), Leimkuhler and Matthews (2015).

*q*. For the distribution \(\varphi (p)\) we will follow common practice and use \(\varphi (p) \propto \exp (-\Vert p\Vert ^2/2)\). With this extension of the space, we recover the standard underdamped form of Langevin dynamics using

*J*and

*S*as follows:

*B*to linear in \(D\). As described in “Appendix 2”, schemes of the form in (7) can also be used to generate proposals in a Metropolis–Hastings framework to strictly enforce a condition that, like detailed balance, guarantees that \(\pi \) is exactly preserved.

Suppose that, when applied to sampling the density \(\pi _{A,v}\), an underdamped Langevin scheme of the form in (7) generates a sequence \((q^{(n)}, p^{(n)})\). The scheme will be referred to as affine invariant if the transformed sequence \((A q^{(n)}+v, p^{(n)})\) has the same distribution as the sequence generated by the method when applied to sample \(\pi \). As for (3) one can demonstrate that the choices \(B(q)B^\text { T}(q) = - (\nabla ^2 \log (\pi (q)))^{-1} \) and \(B(q)B^\text { T}(q) = {\varSigma }\), yield affine invariant sampling schemes (see “Appendix 5” for details). Recall that the choice of \(S(x)=-(\nabla ^2 \log (\pi (x)) )^{-1}\) in (3) also gave an affine invariant scheme, but that there the *S* matrix appears multiplying the noise (making it multiplicative).

Before proceeding to the important issue of selecting a practically useful choice of *B*, we observe the following important properties of our formulation: (i) the stochastic dynamical system (6) exactly preserves the target distribution (see Ma et al. 2015) and thus, if discretization error is well controlled, Metropolis correction is not necessarily needed for the computation, and (ii) the formulation, with appropriate choice of *B*, is affine invariant, even under discretization (see “Appendix 5”), a property which ensures the stability of the method under change of coordinates. By contrast, we emphasize that schemes that modify *S* (instead of *J*) in (5) or that are based on a *q*-dependent normal distribution \(\varphi \) in (4) (e.g. within HMC as in (Girolami and Calderhead 2011a)), cannot be made affine invariant in the same sense, though they can be made to satisfy an alternative notion of affine invariance (see “Appendix 5”).

With the general stochastic quasi-Newton form in (7) as a template, one may consider many possible choices of *B*. Just as in optimization, in MCMC the question is not whether one should precondition, but rather how can one precondition in an affordable and effective way. Unfortunately, practical and effective quasi-Newton approaches for optimization do not have direct analogues in the sampling context, leaving a substantial gap between un-preconditioned methods and often impractical preconditioning approaches. In the next section, we suggest an alternative strategy to fill this gap: using multiple copies of a simulation to incorporate local scaling information in the *B* matrix in (7).

## 3 Ensemble quasi-Newton (EQN) schemes

*L*walkers (independent copies evolving under the same dynamics) with state \(x_i=(q_i,p_i)^\text { T}\), where subscripts now indicate the walker index. Each walker has position \(q_i\) and momentum \(p_i\) for \(i=1,\cdots ,L\), and we define the vectors \(Q=(q_1, q_2, \ldots , q_L)^\text { T}\in \mathcal {D}^L\) and \(P=(p_1,p_2,\ldots ,p_L)^\text { T} \in \mathbb {R}^{DL}\). We seek to sample the product measure \(\bar{\pi }\) whose marginals give copies of the distribution of interest \(\pi \):

*B*(

*q*) preconditioning matrix in order to scale the dynamics based upon information from the other walkers. This preconditioning enters into the dynamics but not the invariant distribution which remains \(\bar{\pi }\). A popular alternative preconditioning strategy is to modify the mass matrix, i.e. the covariance of the Gaussian distribution \(\varphi \) in (4) (see e.g. Girolami and Calderhead 2011a or “Appendix 5”). In our context of ensemble-based schemes, this strategy would introduce substantial (and costly) communication between walkers at each evolution step.

Using *L* walkers, the global state \(x=(Q,P)\) consists of \(2DL\) total variables and *B*(*Q*) is a \(DL \times DL\) matrix. We will use \(B(Q)=\mathrm {diag}(B_1(Q),B_2(Q),\ldots ,B_L(Q))\) with each \(B_i(Q)\in \mathbb {R}^{D\times D}\) so that the position and momentum \((q_i,p_i)\) of walker *i* evolve according to (7) with *B*(*q*) replaced by \(B_i(Q)\). Note that the divergence and gradient terms in the equation for each walker are taken with respect to the \(q_i\) variable.

Within this quasi-Newton framework, there are many potential choices for the \(B_i\) matrix, with \(B_i=I_D\) reducing to the simulation of *L* independent copies of underdamped Langevin dynamics. Before exploring the possibilities, we remark that, in order to exploit parallelism, we will divide our *L* walkers into several groups of equal size in an approach similar to the emcee package (Foreman-Mackey et al. 2013). Walkers in the same group *g*(*i*) as walker *i* will *not* appear in \(B_i\) so that the walkers in any single group can be advanced in parallel independently. The fact that \(B_i\) is independent of walkers in the same group as walker *i* is vital when we introduce the Metropolis step to exactly preserve the target distribution (see “Appendix 2”).

We set \( Q_{[i]} = \{q_j\,|\,g(j)\ne g(i)\} \) and let *K* be the common size of these sets. For example, if we have 16 cores available we may wish to use ten groups of 16 walkers (so \(L=160\) and \(K=144\)). If walker *j* is designated as belonging to group 1, it evolves under the dynamics given in equation (7) but the set \(Q_{[j]}\) only includes walkers in groups \(2,\ldots ,10\). We may then iterate over the groups of walkers sequentially, moving all the walkers in a particular group in parallel with the others.

*A*and vector

*v*, generates a sequence of vectors \((q_1^{(n)},\dots , q_L^{(n)}, p_1^{(n)},\dots ,p_L^{(n)})\) with the property that the transformed sequence \((A q_1^{(n)} +v, \dots , Aq_L^{(n)}+v, p_1^{(n)},\dots , p_L^{(n)})\) has exactly the same distribution as the sequence generated by the ensemble scheme applied to \(\bar{\pi }\) (see “Appendix 5”). Just as choosing

*B*as the square root of global covariance of \(\pi \) in (6) yields an affine invariant scheme, choosing the \(B_i\) as the square root of the ensemble covariance yields an affine invariant ensemble scheme. This affine invariance property suggests that ensemble schemes with \(B_i\) chosen as in (8) should perform well when the covariance of \(\pi \) has a large condition number. A related choice in the context of an overdamped formulation appears in Greengard (2015) and is shown to be affine invariant. An ensemble version of the HMC scheme using a mass matrix inspired by the BFGS optimization scheme appears in Zhang and Sutton (2011) though the relationship between that mass matrix and an approximation of the Hessian of \(\log (\pi )\) or its inverse seems unclear because the method does not evaluate the derivative of \(\log (\pi )\) at nearby points.

*Q*in (11) is essential for preserving the validity of the scheme. Choosing \(\lambda =0\) reduces (11) to (10), whereas a large value of \(\lambda \) gives more refined estimation of the local scaling properties of the system. The divergence term in (7) can be computed explicitly by computing partial derivatives of \(B_i(q)\), making use of the formula for the derivative of the square root of a matrix: \(\partial _i M(x) = M \Phi (M^{-1} (\partial _i(MM^T) ) M^{-T})\), where \(\Phi (M)=\text {lower}(M) + \text {diag}(M)/2\). Note that the matrices \(B_i B_i^\text { T}\) for \(B_i\) in (11) are sums of the identity and

*L*rank one matrices so that all manipulations involving \(B_i\) can be accomplished in linear cost in the dimension \(D\). In “Appendix 2”, we detail a Metropolis–Hastings step that can be implemented (if needed) to correct any introduced bias. Because our ensemble scheme preserves \(\pi \) exactly when \({\delta t}\) is small, one can also use the scheme absent of any Metropolis–Hastings step, improving the prospects for it to scale to very high dimension. Omission of the Metropolis–Hastings step for Langevin type methods is common practice in molecular dynamics MCMC simulations (see Leimkuhler and Matthews 2015 ) and has been considered in the context of computational statistics in (Dalalyan 2016; Durmus and Moulines 2016; Welling and Teh 2011).

*L*walkers into

*G*groups, where walker

*w*is in group number

*g*(

*w*). We also choose the number of steps to take between parallel communication, \(T\le N\), and initialize the momentum vector for each walker

*w*so \(p_w\sim N(0,I_D)\).

Typically it is most efficient to choose the size of each group to be a multiple of the number of available cores, in order to make the **parfor** loop efficient. The *step* function uses one new evaluation of the force \(\nabla \log (\pi )\) each time it is called, as well as a new evaluation of the *B* matrix and its derivative. We can minimize parallel communication by setting *T* large to infrequently broadcast the new walker data.

## 4 Numerical tests

We consider two numerical experiments to demonstrate the potential improvements that this method offers. A python package with example implementations of the code is available at Matthews (2016).

### 4.1 Gaussian mixture model

*y*to a univariate mixture model as the sum of three Gaussian distributions. The state vector is described by the means, precisions and weights of the three Gaussian distributions, denoted \(\mu _i\), \(\lambda _i\) and \(z_i\), respectively. Due to the sum of the weights equalling unity, this gives us eight variables describing the mixture model. We also include a hyperparameter \(\beta \) that describes the rate parameter in the prior distribution on the precisions, giving \(D=9\) for the state overall. A full description of the problem is available in “Appendix 3”.

We consider the Hidalgo stamps benchmark dataset, studied in (Izenman and Sommer 1988), as the data *y* with 485 datapoints. This example is well suited to the local covariance approach we present above, due to the invariance of the likelihood under a permutation of components (the label-switching problem). Thus, the system admits sets of \(3!=6\) equivalent modes, see Fig. 2, each with a local scaling matrix that has the same eigenvalues with permuted eigenvectors.

Though strictly speaking the problem is multimodal, the high barriers between modes make hopping between the basins extremely unlikely (we did not observe any switching in any simulations). Thus, this problem effectively tests the exploration rate within one well, with the symmetry between the modes guaranteeing the same challenges in each basin. The walkers may initialize in the neighbourhood of different local modes so that using a “global” preconditioning strategy would be sub-optimal. The best preconditioning matrix for the current position of a walker depends on which mode is closest to the walker. Instead, we use the covariance information from proximal walkers as in (11) to determine the optimal scaling.

Computed autocorrelation times for slow variables, with the variable with the slowest motion marked in bold for each method

Scheme | \(\mathrm {min}(z)\) | \(\mathrm {max}(\lambda )\) | \(\mathrm {min}(\mu )\) | \(\beta \) |
---|---|---|---|---|

HMC | 21495 | | 27452 | 7148 |

Langevin dynamics | 6825 | | 8384 | 4641 |

Ensemble Q-N | 69 | 83 | 98 | |

We consider all three methods as equivalent in cost, as they require the same number of evaluations of \(\nabla \log (\pi )\) per step, and scale similarly with the size of the data vector *y*. Comparing the slowest motions of the system, the EQN scheme is about 100 times more efficient compared to Langevin dynamics and 350 times more efficient than HMC. We found that removing the divergence term in the EQN scheme had no significant impact on the results.

### 4.2 Log Gaussian Cox model

*X*from given observation data

*Y*.

We make use of the RMHMC Matlab code template in our experiments (Girolami and Calderhead 2011b). In the model, we discretize the unit square into a \(32\times 32\) grid, with the observed intensity in each cell denoted \(Y_{i,j}\) and Gaussian field \(X_{i,j}\). We use two hyperparameters \(\sigma ^2\) and \(\beta \) to govern the priors, making the dimensionality of the problem \(D= 32^2+2=1026\) dimensions. Full details of the model are provided in “Appendix 4”.

As the evaluation of the derivative of the likelihood is significantly cheaper with respect to the latent *x* variables (tests showed computing the hyperparameter’s derivatives to be about one hundred times slower), we employ a partial resampling strategy to first sample the latent variables using multiple steps and then perform one iteration for the hyperparameter distribution.

We generate synthetic test data *Y*, plotted in Fig. 3, and compare the HMC and Langevin dynamics schemes to EQN (using 160 walkers) and the RMHMC scheme (Girolami and Calderhead 2011a). We additionally compare the results using the Langevin dynamics and EQN scheme without Metropolization, as the dynamics themselves sample \(\pi \), and the Metropolis step only serves to remove discretization error (which is dominated by the sampling error in this example). RMHMC uses Hessian information to obtain scaling data for the distribution. This gives it a significant increase in cost, but improves the rate at which the sampler decorrelates. For the model, the RMHMC scheme requires approximately 2.2s per step, whereas the other schemes require approximately 0.35s per step.

*x*variables. The efficiency is also shown, calculated as the wall time required per step divided by the autocorrelation time for the slowest hyperparameter (then normalized with respect to the HMC result). The slowest hyperparameter is compared instead of the slowest component of

*x*because evolving the

*x*dynamics requires less computation, hence it is trivial to reduce the autocorrelation time of

*x*without significantly impacting the wall time.

Maximum autocorrelation times for each variable using each scheme

Scheme | | \(\sigma ^2\) | \(\beta \) | Efficiency |
---|---|---|---|---|

HMC | 800.7 | 1041.6 | 1318.7 | 1.0 |

RMHMC | 2158.9 | 34.0 | 1502.0 | 0.15 |

LD | 405.1 | 140.6 | 435. 3 | 3.5 |

\(\cdots \) (no Metropolis) | 81.6 | 20.5 | 136. | 11.2 |

EQN | 71.9 | 49.2 | 239.5 | 5.4 |

\(\cdots \) (no Metropolis) | 64.4 | 8.8 | 47.8 | 26.8 |

In the results, the EQN scheme significantly outperforms the other methods, with the slowest motion of the system (the \(\beta \) hyperparameter) decorrelating more rapidly than the HMC or Langevin schemes for approximately the same cost. The RMHMC scheme’s requires significant extra computation, making it much less efficient than the standard HMC scheme in this example.

## 5 Conclusion

We have presented a sampling algorithm that utilizes information from an ensemble of walkers to make more efficient moves through space, by discretizing a continuous ergodic quasi-Newton dynamics sampling the target distribution \(\pi (x)\). The information from the other walkers can be introduced in several ways, and we give two examples using either local or global covariance information. The two forms of the \(B_i\) preconditioning matrix are then tested on benchmark test cases, where we see significant improvement compared to standard schemes.

The EQN scheme is cheap to implement, requiring no extra evaluations of \(\nabla \log \pi (x)\) compared to schemes like MALA, and needing no higher derivative or memory terms. The scheme is also easily parallelizable, with communication between walkers being required infrequently. The dynamics (6) is novel in their approach to the introduction of the scaling information, and we build on previous work using walkers running in parallel to provide a cheap alternative to Hessian data.

The full capabilities of the EQN method, in the context of complex data science challenges, remain to be explored. It is likely that more sophisticated choices of \(B_i\) are merited for particular types of applications. The propagation of an ensemble of walkers also suggests natural extensions of the method to sensitivity analysis and to estimation of the sampling error in the MCMC scheme. Also left to be explored is the estimation of the convergence rate as a function of the number of walkers, which may be possible for simplified model problems.

## Notes

### Acknowledgements

The authors would like to thank Aaron Dinner, Andrew Duncan and Mark Girolami for many useful discussions and comments. We also would like to thank the anonymous referees for many helpful suggestions.

## References

- Andrés Christen, J., Fox, C., et al.: A general purpose sampling algorithm for continuous distributions (the t-walk). Bayesian Anal.
**5**(2), 263–281 (2010)MathSciNetCrossRefMATHGoogle Scholar - Bennett, C.H.: Mass tensor molecular dynamics. J. Comput. Phys.
**19**(3), 267–279 (1975)Google Scholar - Bou-Rabee, N., Vanden-Eijnden, E.: Pathwise accuracy and ergodicity of metropolized integrators for SDEs. Commun. pure appl. math.
**63**(5), 655–696 (2010)MathSciNetMATHGoogle Scholar - Bou-Rabee, N., Donev, A., Vanden-Eijnden, E.: Metropolis integration schemes for self-adjoint diffusions. Multiscale Model. Simul.
**12**(2), 781–831 (2014). doi: 10.1137/130937470 MathSciNetCrossRefMATHGoogle Scholar - Cappé, O., Guillin, A., Marin, J.M., Robert, C.P.: Population Monte Carlo. J. Comput. Gr. Stat.
**13**(4), 907–929 (2004). doi: 10.1198/106186004X12803 MathSciNetCrossRefGoogle Scholar - Chopin, N., Lelièvre, T., Stoltz, G.: Free energy methods for Bayesian inference: efficient exploration of univariate Gaussian mixture posteriors. Stat. Comput.
**22**(4), 897–916 (2012)MathSciNetCrossRefMATHGoogle Scholar - Christensen, O.F., Roberts, G.O., Rosenthal, J.S.: Scaling limits for the transient phase of local Metropolis–Hastings algorithms. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**67**(2), 253–268 (2005)MathSciNetCrossRefMATHGoogle Scholar - Dalalyan, A.S.: Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B (Stat. Methodol.) (2016). doi: 10.1111/rssb.12183
- Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B
**195**(2), 216–222 (1987). doi: 10.1016/0370-2693(87)91197-X CrossRefGoogle Scholar - Duncan, A.B., Lelièvre, T., Pavliotis, G.A.: Variance reduction using nonreversible Langevin samplers. J. Stat. Phys.
**163**(3), 457–491 (2016). doi: 10.1007/s10955-016-1491-2 MathSciNetCrossRefMATHGoogle Scholar - Durmus, A., Moulines, E.: High-dimensional Bayesian inference via the Unadjusted Langevin Algorithm. https://hal.inria.fr/TELECOM-PARISTECH/hal-01304430v2 (2016)
- Foreman-Mackey, D., Hogg, D.W., Lang, D., Goodman, J.: emcee: the MCMC hammer. Publ. Astron. Soc. Pac.
**125**(925), 306 (2013)CrossRefGoogle Scholar - Gilks, W.R., Roberts, G.O., George, E.I.: Adaptive direction sampling. J. R. Stat. Soc. Ser. D (Stat.)
**43**(1), 179–189 (1994)MATHGoogle Scholar - Girolami, M., Calderhead, B.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**73**(2), 123–214 (2011a). doi: 10.1111/j.1467-9868.2010.00765.x MathSciNetCrossRefGoogle Scholar - Girolami, M., Calderhead, B.: Matlab code for the RMHMC scheme. http://www.ucl.ac.uk/statistics/research/rmhmc, (2011b). [Online; accessed 01-Dec-2015]
- Goodman, J.: ACOR package. http://www.math.nyu.edu/faculty/goodman/software/, (2009). [Online; accessed 01-Dec-2015]
- Goodman, J., Sokal, A.D.: Multigrid Monte-Carlo method-conceptual foundations. Phys. Rev. D
**40**(6), 2035–2071 (1989)CrossRefGoogle Scholar - Goodman, J., Weare, J.: Ensemble samplers with affine invariance. Commun. appl. math. comput. sci.
**5**(1), 65–80 (2010)MathSciNetCrossRefMATHGoogle Scholar - Greengard, P.: An ensemblized Metropolized Langevin sampler. Master’s thesis, NYU, (2015)Google Scholar
- Haario, H., Saksman, E., Tamminen, J.: An adaptive Metropolis algorithm. Bernoulli
**7**(2), 223–242 (2001)MathSciNetCrossRefMATHGoogle Scholar - Hairer, M., Weare, J.: Improved diffusion Monte Carlo. Commun. Pure Appl. Math.
**67**, 1995–2021 (2014)MathSciNetCrossRefMATHGoogle Scholar - Hammersley, J.M., Morton, K.W.: Poor man’s Monte Carlo. J. R. Stat. Soc. B
**16**(1), 23–38 (1954)MathSciNetMATHGoogle Scholar - Hwang, C.-R., Hwang-Ma, S.-Y., Sheu, S.-J.: Accelerating Gaussian diffusions. Ann. Appl. Probab.
**3**(3), 897–913 (1993)Google Scholar - Hwang, C.-R., Hwang-Ma, S.-Y., Sheu, S.-J., et al.: Accelerating diffusions. Ann. Appl. Probab.
**15**(2), 1433–1444 (2005)MathSciNetCrossRefMATHGoogle Scholar - Iba, Y.: Population Monte Carlo algorithms. Trans. Jpn. Soc. Artif. Intell.
**16**(2), 279–286 (2001). doi: 10.1527/tjsai.16.279 CrossRefGoogle Scholar - Izenman, A.J., Sommer, C.J.: Philatelic mixtures and multimodal densities. J. Am. Stat. assoc.
**83**(404), 941–953 (1988)CrossRefGoogle Scholar - Jasra, A., Stephens, D.A., Holmes, C.C.: On population-based simulation for static inference. Stat. Comput.
**17**(3), 263–279 (2007). doi: 10.1007/s11222-007-9028-9 MathSciNetCrossRefGoogle Scholar - Leimkuhler, B., Matthews, C.: Molecular Dynamics: With Deterministic and Stochastic Numerical Methods. Interdisciplinary Applied Mathematics. Springer International Publishing, New York (2015)MATHGoogle Scholar
- Leimkuhler, B., Matthews, C., Stoltz, G.: The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics. IMA J. Numer. Anal. (2015). doi: 10.1093/imanum/dru056
- Liu, J.: Monte Carlo Strategies in Scientific Computing. Springer, New York (2002)Google Scholar
- Ma, Y.A, Chen, T., Fox, E.: A complete recipe for stochastic gradient MCMC. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 2917–2925. Curran Associates, Inc., New York (2015)Google Scholar
- Martin, J., Wilcox, L.C., Burstedde, C., Ghattas, O.: A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion. SIAM J. Sci. Comput.
**34**(3), A1460–A1487 (2012)MathSciNetCrossRefMATHGoogle Scholar - Matthews, C.: Ensemble Quasi-Newton python package. http://bitbucket.org/c_matthews/ensembleqn, (2016). [Online; accessed 01-Jul-2016]
- Milstein, G., Tretyakov, M.: Stochastic Numerics for Mathematical Physics. Springer, New York (2004)CrossRefMATHGoogle Scholar
- Monnahan, C.C., Thorson, J.T., Branch, T.A.: Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods Ecol. Evol. (2016). doi: 10.1111/2041-210X.12681
- Neal, R.M., et al.: MCMC using Hamiltonian dynamics. Handb. Markov Chain Monte Carlo
**2**, 113–162 (2011)MathSciNetMATHGoogle Scholar - Pavliotis, G.A.: Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations. Texts in Applied Mathematics. Springer, New York (2014)CrossRefMATHGoogle Scholar
- Rey-Bellet, L., Spiliopoulos, K.: Irreversible Langevin samplers and variance reduction: a large deviations approach. Nonlinearity
**28**(7), 2081 (2015)MathSciNetCrossRefMATHGoogle Scholar - Roberts, G.O., Rosenthal, J.S.: Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J. Appl. Prob.
**44**(2), 458–475 (2007)Google Scholar - Roberts, G.O., Tweedie, R.L.: Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika
**83**(1), 95 (1996). doi: 10.1093/biomet/83.1.95 MathSciNetCrossRefMATHGoogle Scholar - Rosenbluth, M.N., Rosenbluth, A.W.: Monte Carlo calculation of the average extension of molecular chains. J. Chem. Phys.
**23**(2), 356–359 (1955)CrossRefGoogle Scholar - Rossky, P.J., Doll, J.D., Friedman, H.L.: Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys.
**69**(10), 4628–4633 (1978). doi: 10.1063/1.436415 CrossRefGoogle Scholar - Sun, W., Yuan, Y.X.: Optimization Theory and Methods: Nonlinear Programming. Springer Optimization and Its Applications. Springer, USA (2006)Google Scholar
- ter Braak, C.J.F.: A Markov chain Monte Carlo version of the genetic algorithm differential evolution: easy Bayesian computing for real parameter spaces. Stat. Comput.
**16**(3), 239–249 (2006)Google Scholar - Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 681–688 (2011)Google Scholar
- Zhang, Y., Sutton, C.A.: Quasi-Newton methods for Markov chain Monte Carlo. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 2393–2401. Curran Associates, Inc., New York. http://papers.nips.cc/paper/4464-quasi-newton-methods-for-markovchain-monte-carlo.pdf (2011)

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.