Parallelizing MCMC Sampling via Space Partitioning

Efficient sampling of many-dimensional and multimodal density functions is a task of great interest in many research fields. We describe an algorithm that allows parallelizing inherently serial Markov chain Monte Carlo (MCMC) sampling by partitioning the space of the function parameters into multiple subspaces and sampling each of them independently. The samples of the different subspaces are then reweighted by their integral values and stitched back together. This approach allows reducing sampling wall-clock time by parallel operation. It also improves sampling of multimodal target densities and results in less correlated samples. Finally, the approach yields an estimate of the integral of the target density function.


Introduction
Markov chain Monte Carlo (MCMC) is a technique that allows generating samples with a distribution proportional to a given target density function 1 .This technique is widely used in Bayesian statistics, statistical mechanics, computational biology, and many other fields or research.One of the major strengths of this technique is that it can converge to the target density even if target functions are highly multidimensional and multimodal.A major difficulty is that convergence is reached only asymptotically, and approaching the stationary distribution can require a very large number of sampling steps.
A Markov chain, by definition, consists of a series of consecutive steps that move a point or a set of points across the parameter space, which is an inherently serial process.A proposed displacement, by definition of the Markov process, does not depend on the history that led to the current location.Determining whether to accept a displacement involves evaluating the target density at the proposed location.For many real-world applications, evaluation of a target density can be very computationally costly, and there is usually a limit to how far a single target evaluation can be parallelized efficiently; this can make MCMC sampling very costly.A further complication stems from the fact that a large number of burn-in steps (the steps necessary for the MCMC to reach the stationary distribution) need to be performed for each MCMC chain before representative samples can be generated.The burn-in duration can even exceed the sampling time, especially for target densities that have a complex shape.While separate MCMC chains can be run independently and in parallel, simply increasing their number while producing fewer samples from each chain is therefore not an effective parallelization strategy as the length of the burn-in process for each chain would not change.
Significant research has been conducted to enhance the efficiency of MCMC methods.The developments in this field can be divided into several categories [1].The first is based on exploiting the geometry of the target density function.Hamiltonian Monte Carlo [2] (HMC) belongs to this category and it introduces the auxiliary variable, called momentum, which is refreshed by using the gradient of the density function.Symplectic integrators of different orders of precision are developed to approximate Hamiltonian equation [3,4].The HMC provides less correlated samples than the Metropolis-Hastings algorithm; however, the gradient of the density is not always readily available or cannot be computed in reasonable time.The second approach of accelerating MCMC is based on improving the proposal function.Techniques such as simulated tempering [5,6], adaptive MCMC [7], and multi-proposal MCMC [8,9] are available and have been shown to be effective for many applications [10,11,12,13,14].The third approach is based on breaking initially complicated problems into simpler pieces.For example, separate MCMC chains explore the parameter space in parallel and the resulting samples are merged together [15,16].As discussed earlier, this approach does not simplify convergence of chains to the stationary distribution.
In this paper, we present an approach that improves MCMC sampling by partitioning the parameter space of the target density function into multiple subspaces and separately sampling the different subspaces.An effective partitioning can thereby change the task from sampling from a complicated target distribution to sampling from many, simpler target distributions.Any MCMC sampling algorithm can be used to generate samples in the subspaces, and all subspaces can be sampled independently and in parallel.This allows for massively parallel and distributed execution on multiple processors of multiple computer systems.After sampling, the integrated density of each subspace is calculated and the samples for each subspace are re-weighted correspondingly.
Our approach results in increased MCMC sampling efficiency, yielding samples with reduced correlations.It also reduces the required MCMC burn-in time significantly since the possibly multiple modes of the full target density will ideally lie in separate subspaces; each chain will then only have to sample a unimodal density.In combination, these benefits result in a shorter total sampling time for each MCMC chain (including burn-in), which makes this approach of running many chains distributed over many subspaces efficient, whereas running many chains over the full density is not (due to constant burn-in overhead).
We start with a brief review of Bayesian data analysis, as this is the primary application that we aim for.In section 3, the general idea of the algorithm and our implementation of it are described.In section 4, performance of the algorithm is shown using the mixture of four multivariate normal distributions as a target density function.Finally, section 5 summarizes developments presented in this paper and discusses further directions.

MCMC for Bayesian Data Analysis
For the given model M , parameters λ, and the data D, Bayes' theorem is defined as where P (D|λ, M ) denotes the likelihood that is used to update the prior probability density P 0 (λ|M ) of λ to the posterior probability density P (λ|D, M ).The denominator is usually called 'evidence' or 'marginal likelihood' and it is given by the Law of Total Probability P (D|M ) = P (D|λ, M )P 0 (λ|M )dλ. ( MCMC methods do not require knowledge of the normalization constant, P (D|M ), to generate samples with the correct distribution.However, we require the normalizing constant in each subspace in order to reproduce the correct target distribution by patching together samples from different subspaces.Our approach thereby relies on being able to accurately calculate P i (D|M ), where i labels one of the subspaces.A variety of techniques that allow estimating the evidence of models exist, an overview summary can be found in [18].These typically require a re-sampling of the target probability density function once modes have been identified and only a few of the methods can be used in a post processing step based on existing samples.We use the recently published AHMI algorithm [19] to evaluate the integrals in each subspace directly from the samples.This provides the user with the correctly weighted samples, and allows for a simple calculation of the evidence for the full target distribution.This is relevant for evaluating, amongst other things, the Bayes factor used in model comparisons.
3 Sampling Parallelization via Space Partitioning

Overview
We consider generating samples according to a target density function f (λ) where λ ⊂ R m and Ω is the support of the function.To illustrate our method, we will use as example the sum of four bivariate normal distributions with λ = (λ 1 , λ 2 ): where a 1 = a 2 = 0.48, a 3 = a 4 = 0.02, µ i = (±3.5,±3.5), Σ 1 = Σ 2 = (0.33, 0.17; 0.17, 0.33), and Σ 3 = Σ 4 = (0.019, −0.003; −0.003, 0.017).Each bivariate normal distribution is individually normalized.Two, in the upper-right and lower-left quadrants, have large weights (0.48) and the other two, in the other quadrants, have small weights (0.02).The covariances are relatively small compared to the separations of the modes, making this a challenging target distribution to sample from for many MCMC algorithms.Probability contours of this test function are shown in Fig. 1-a.
Our approach consists of the following four steps: 1. Generate a set of N exp exploration samples {λ * i } i=1..Nexp ∈ Ω, distributed amongst N chains , where N exp is a small number compared to the desired number of final MCMC samples and N chains is the number of chains.The chains should have different (possibly randomly chosen) starting points.The samples are used to find region/s of the parameter space with a high density and the MCMC chains are not required to converge.An initial sampling of our example function with N exp = 500 generated using N chains = 25 with 20 samples per chain is shown in Fig. 1-a.2. Partition the parameter space into N sp mutually exclusive subspaces {ω k } k=1..Nsp ∈ Ω in such a way that ∪ω k = Ω, ω k ∩ ω m = ø if k = m (see also Fig. 1-b).While in general, the boundaries of the subspaces could be arbitrary shapes, in the following, 'N space partitions' will refer to N cuts along different, single parameter axes.This splits the parameter space into N + 1 rectangular subspaces.

Generate
∈ ω k in each subspace k with the distribution proportional to f (λ) using a sampling algorithm of choice (see Fig. 1-c).Note that each sampler has to perform its burn-in cycle only in the reduced subspace ω k , which significantly reduces tuning time.
4. Determine the integrated density of the target distribution in each subspace by computing I k = ω k f (λ)dλ and assign the following weights to the sample of subspace k 5. Stitch the now weighted samples together resulting in the final sampling distribution (see Fig. 1-d).
There are many ways of implementing the described idea based on choices of samplers, integrators and space partitioning strategies.In the following, we describe our implementation that is also made available in the Bayesian Analysis Toolkit package [20].

Exploration samples
Exploration samples play an important role in this algorithm, since the parameter space is partitioned based on them.If exploration samples represent the structure of the target density closely enough -for example indicating the presence of multiple modes by clusters of spatially neighboring points -then the space partitioning algorithm can capture these features and generate partitions in such a way to split those clusters.This simplifies the target density in each subspace and thus allows for much faster burnin and tuning procedures.In our implementation, we generate exploration samples by running a large number of MCMC chains, where each chain generates a few hundred samples.There is no tuning or convergence requirement for these chains, but a small set of samples are initially used to set the parameters of the proposal functions for each chain.Some knowledge of the form of the target distribution is useful in determining how many chains and how many samples will be necessary.While the morphology of the resulting sample clouds should resemble that of the target density as closely as possible, this initial exploration should be fast compared to the following sampling time in the partitioned space.

Space Partitioning
Given the discussed exploration samples, we partition our parameter space into rectangular subspaces in such a way as to split clusters of spatially neighboring samples.To do so, a binary tree is used where each node is determined by a cut that is orthogonal to parameter axes.For the sake of illustration, we consider a one-dimensional problem and exploration samples {λ} (see upper histogram in Fig. 2).The cut position perpendicular to the λ axis is denoted as λ and it is selected by finding the minimum of the following cost function: where λ denote the mean of samples.This process is then repeated iteratively resulting in the desired number of partitions.The blue lines in Fig. 2 demonstrate how this cost function depends on the cut positions for 3 partitioning steps.The partitioning procedure is ended when a minimal change in the cost function results from further partitioning or a maximum number of subspaces is reached.The evolution of the cost function for our example is also shown in Fig. 2.
The partitioning procedure is analogous for higher dimensional space.If, for instance, a sample vector is M -dimensional, then we evaluate Eq. 4 for every dimension which results in proposed cut positions λ i with corresponding cost values W i (a i , λ i ) for dimension

., M
).The minimum cost value is selected and the cut along the corresponding dimension is accepted.Additionally, if preliminary knowledge about the structure of the target density is present, the user can specify manually along which parameters partitioning of the parameter space should be performed.

Sampling
Sampling in the subspaces is performed independently and does not require communication between MCMC processes.It can therefore be trivially divided into tasks and executed in parallel on multiple processors using distributed computing.In the following, we define a 'worker' as a computing unit (this can either be a node of a cluster, networked machine, or a single machine) that consists of multiple CPU cores used to perform one task; i.e., sample the target density in one subspace.All the cores that belong to one worker are called threads and are used to run multiple MCMC chains in parallel within one subspace.Running multiple chains within one subspace is advantageous for determining convergence of the MCMC process.Our implementation allows running subspaces on multiple remote hosts using Julia's support for compute clusters, and MPI/TCP/IP protocols for communication between workers can be used.By default, the Metropolis-Hastings algorithm is used to generate MCMC samples on each subspace.However, any other sampling algorithm can be used.

Reweighting
Samples that originate from different subspaces have different, and a priori unknown, normalizations with respect to each other.In order to correctly stitch those together, a weight proportional to the integral of the target density within the subspace needs to be applied.Given that samples are drawn from the target function λ k ∼ f (λ) in each subspace k, we compute the following integrals: As noted earlier, we use the Adaptive Harmonic Mean Integration (AHMI).Given samples drawn according to a density proportional to the function, the AHMI algorithm estimates the integral of the function and the uncertainty of the estimate acting on samples as a postprocessing step, with no requirements to reevaluate the target density.The AHMI algorithm in its current form has been used to accurately integrate functions in up to 20 dimensions.

Final sample
Once the weighting is performed, the samples from multiple subspaces are concatenated and returned to the user.The total integral of the function f (λ) is then estimated by summing the weights of the subspaces I = k=1..Nsp I k .
Our implementation was used to generate the results shown in Fig. 1.

Performance
In this section, we evaluate the performance of our algorithm on a more complicated test density function.The function was chosen in such a way to (a) have a known analytic integral, (b) allow generating independent and identically distributed (i.i.d) samples, and (c) have multiple modes in many dimensions and thus be challenging for a classical Metropolis-Hastings algorithm.With this aim, we have chosen a mixture of four multivariate normal distributions in 9-dimensional space: where all a i = 1/4 and µ and Σ are randomly assigned mean vector and a covariance matrix.The full parameter set used for this performance test are given in the Appendix.Figure 3 illustrates one and two dimensional distributions of 10 5 i.i.d samples drawn from this density.There are two primary points that we demonstrate in this section.The first one is the ability to improve the wall-clock time spent on sampling by utilizing efficiently computational resources.The second one is the ability to improve the quality of samples once we increase the number of space partitions.Measurements of the performance were evaluated as follows: • We use a varying number of subspaces S = (1, 2, 4, 8, 16, 32).Sampling and integration in different subspaces are executed in parallel using 1 worker per subspace with 10 CPU cores per worker.All the CPU cores that are available for the worker are used for multithreaded chain execution.Histograms are constructed using 10 5 i.i.d samples.
• In addition, we also vary the wall-clock time that workers can spend on generating samples, considering time intervals of 3, 7, 11, and 15 seconds.
• For every combination of space partitions and wall-clock times, we repeat the sampling process 3 times to evaluate statistical fluctuations.
The overall number of MCMC runs is 72.An example run is: 8 subspaces and 8 workers with 10 CPU cores each are used to sample 10 chains for 11 seconds of wall-clock time, after which samples are integrated and returned; sampling with this setting is repeated 3 times.

Sampling Rate
A summary of the benchmark run is demonstrated in Fig. 4. We used the MPCDF HPC system DRACO2 with Intel 'Haswell' Xeon E5-2698 processors (880 nodes with 32 cores @ 2.3 GHz each) to perform parallel MCMC executions.While sampling with space partitions, each subspace requires a different amount of time on sampling and integration, depending on the complexity of the underlying density region.The time of  The different colors represent runs with different numbers of partitions; the number of subspaces is denoted by S. The vertical axis is common for all subplots and gives the ratio of the total number of samples generated per single run to the number of samples that are generated if no space partition is performed (N ref = 3.3 • 10 4 ).Left subplot: The horizontal axis shows the ratio of the time spent on sampling and integration to the time that a single worker spent if no space partition is performed (t ref = 14.5 s).The lines are from linear fits of measurements.Middle subplot: The horizontal axis shows the ratio of the integral to the true value; error bars are obtained from the integration algorithm.Right subplot: The horizontal axis illustrates the ratio of the effective number of samples (separately for each dimension) to the total number of samples.An effective number of samples is estimated per dimension and the error bars represent the standard deviation across dimensions.Dashed colored lines represent the average fraction over the runs with the same number of space partitions.the slowest one is reported in Fig. 4. It can be seen that, by changing the number of subspaces from 1 to 32 (and the number of total CPU cores from 10 to 320), the number of generated samples increases almost two orders of magnitude while the wall-clock time remains constant.
Figure 4 can be rearranged into a slightly different form in order to demonstrate the sampling rate.We define N 0 as the number of samples that the sampler with no space partitions has generated during the time interval ∆t 0 = t stop − t start (t stop is the wallclock time when integration has finished and t start is the wall-clock time when sampling on subspace has started).We further denote as N k the total number of samples from the run with k subspaces, and the time spent on each subspace as ∆t k .The sampling rate  is defined as In addition to the wall-clock time, we measure a CPU time spent on sampling and integration on each subspace using the CPUTime.jlpackage3 .We denote it as τ i , where i is the subspace index, and τ 0 is a CPU time when no space partitions are used.The per-chain sampling rate is defined as Eq. 7 and Eq. 8 do not include time spent on the generation of exploration samples and construction of the partition tree.A time spent on the generation of exploration samples depends primarily on the complexity of a likelihood evaluation, and for our problem, it is equal to 4 seconds.The time required to generate the space partition tree primarily depends on the number of exploration samples, it is about 2 seconds for our problem.
The sampling rate and the per-chain sampling rate versus the number of space partitions are presented in Fig. 5.The figure shows that the sampling rates are improved for both cases.Improvements in the per-chain sampling rate indicates that by partitioning the parameter space we simplify the target density function resulting in faster tuning and convergence (tuning and convergence are occurring in every subspace).While improvement in the sampling rate is expected due to the scaling of the number of CPU cores, its faster-than-linear behavior can be explained by a superposition of the improved per-chain sampling rate and the linear sampling rate.

Density Integration and Effective Sample Size
Another important characteristic to track is the integral estimate of the target density function.If, for example, samples are not correctly representing the target function, then the integral will deviate from the truth.By partitioning the parameter space we are simplifying tasks for both the sampler and integrator; the complicated problem has been split into a number of simpler ones.This results in better integral estimates, which can be seen in Fig. 4 (middle subplot).
Samples that originated from one MCMC chain are correlated.A degree of sample correlation depends on the acceptance probability, number of chains, complexity of the target density function, etc.The effective number of samples can be estimated as where N is the number of samples and τ is the integrated autocorrelation time, estimated via the normalized autocorrelation function ρ(τ ): where k refers to the one of the M dimensions of the multivariate sample λ i = {λ 1,i , ..., λ M,i } and λk is the average of the k-th component of all the N samples.In order to compute the sum given by Eq. 10, we use a heuristic cut-off given by Geyer's initial monotone sequence estimator [21].This technique allows us to calculate an effective sample size for each dimension N ef f,k = N τk .As it is shown in Fig. 4 (right subplot), the effective number of samples increases with the number of space partitions.It can also be seen that in our example there is no increase of the fraction of effective samples when the number of subspaces exceeds 8.

Summary of Scaling Performance
A summary of the performance enhancement from partitioning and sampling in parallel is presented in Fig. 6.The accuracy of the integrals of the function for different numbers of subspaces as a function of sampling wall-clock time is shown in the left panel.There, we see that even with the longest running times tested, the integral estimate without partitioning deviates considerably from the correct value.Good results are seen already for the shortest running times with 4 subspaces, and running on 32 subspaces gives excellent results in all running times tested.
The right panel in Fig. 6 shows the average number of effective samples for different combinations of numbers of subspaces and running times.We find a dramatic increase in the number of effective samples: the factor achieved in the effective number of samples is an order of magnitude larger than the increase in the computing resources (number of processors).This is due to the much simpler forms of the distributions sampled in the subspaces.
Both of these results show that a strong scaling of the performance is achieved using the partitioning scheme that we have outlined.

Sampling Accuracy
Since our algorithm requires the weighting of samples from different subspaces and stitching them together, we test whether this results in a smooth posterior approximating the true target density function.For comparison, we also approximate the true density by directly generating i.i.d samples.In the following, we will use the two-sample Kolmogorov-Smirnov test [22] and a two-sample classifier test [23] as a quantitative assessment of how close our MCMC samples are to the i.i.d ones.

Kolmogorov-Smirnov Test
The two-sample Kolmogorov-Smirnov test is used to test whether two one-dimensional marginalized samples come from the same distribution.P-values from this test for every marginal and different number of space partitions are shown in Fig.

Classification Test
A second approach to determining whether two samples are similar to one another uses a binary classifier aiming at distinguishing them.A training dataset is constructed by pairing MCMC and i.i.d samples with opposite labels.If samples are indistinguishable, the classification accuracy on the test dataset should be close to that obtained from a random guess.To train a classifier, we use a simple neural network model with two dense layers with sizes 9 × 20 and 20 × 2, and the sigmoid activation function.To construct training and testing datasets, we generate MCMC and i.i.d samples with equal weights.By default, i.i.d samples come with weight one.For our weighted MCMC however, we generate 3 • 10 4 samples with unit weights by using ordered resampling implemented in [20].In total, 4 • 10 4 samples are used for training and 2 • 10 4 for testing with an equal fraction of MCMC and i.i.d samples.Training is performed for every MCMC run that is described in Fig. 4 individually and results are presented in Fig. 8.The left subplot in Fig. 8 shows the modified receiver operating characteristic (ROC), where the vertical axis is a difference between true positive rate (TPR) and false positive rate (FPR) for different MCMC runs.If the classifier cannot distinguish two samples, then the line will fluctuate around zero.The right subplot in Fig. 8 shows the integral under the ROC curve (expected to be close to 0.5 for indistinguishable distributions) versus the number of MCMC samples (before resampling was performed).It can be seen that even though the training datasets consist of the same number of MCMC samples, there is a difference in their distinguishability.Samples obtained from the runs with a large number of space partitions have ROC curve integrals much closer to 0.  difference in the 1d Kolmogorov-Smirnov tests.

Conclusions
We have presented an approach to both improve and accelerate MCMC sampling by partitioning the parameter space of the target density function into multiple subspaces and sampling independently in each subspace.These subspaces can be sampled in parallel and the resulting samples then stitched together with appropriate weighting.The scheme relies on a good space partitioning, which we achieve using a binary partitioning algorithm, and a good integrator for determining the weights assigned to the samples in the different subspaces.The integrations in our examples were performed using the AHMI [19] algorithm.The parallized MCMC sampling algorithm described in the paper has been implemented in the Bayesian Analysis Toolkit [20].
We have benchmarked this technique by evaluating the quality of samples and the sampling rate for a mixture of four multivariate normal distributions in 9-dimensional space.We demonstrate that the space partitioning allows us to obtain a 50-fold increase in the sampling rate while increasing the number of CPUs by a factor 32.This increase is a superposition of two effects: a linear scaling with the number of CPU cores, and a CPU-time reduction due to the simplification of the target density function.In addition to the increase in the sampling rate, sampling with space partitioning also resulted in an increased quality of MCMC samples by reducing their correlations.This was evidenced in particular by more accurate integral values of the target density.
We have evaluated the correctness of the resulting sampling distributions by comparing the MCMC samples with i.i.d samples using a two-sample Kolmogorov-Smirnov test and with a two-sample classifier test.Both show that increasing the number of space partitions leads to a better agreement between MCMC and i.i.d samples.
Finally, this approach provides the user with a normalization constant of the target density function -the Bayesian evidence is provided at no extra cost.

A Appendix
Covariance matrices used in Eq. 6 are diagonal with the size 9 × 9 and are equal to

Figure 1 :
Figure 1: Color maps displaying the steps in the partitioned sampling approach.(1) The 500 samples from the exploration step for the target density defined in Eq 6.The red dashed lines demonstrate contours of the true density.(2) The result of partitioning the parameter space into 30 subspaces.The black lines demonstrate the boundaries of subspaces.(3) The 10 4 samples in each subspace from the individual MCMC chains.(4) The weighted samples displayed in the full space.

Figure 2 :
Figure 2: Illustration of the space partitioning algorithm using 5 • 10 3 one-dimensional exploration samples λ * with a distribution demonstrated in the upper left histogram.The red dashed lines demonstrate the first three cut positions λ.The blue lines show the value of W (a, λ) as a function of the cut position for 3 iterations of space partitioning.The gray dashed lines show the value of the cost function at its minimum for each iteration.The bottom right subplot illustrates the dependence of W as a function of the number of cuts.

Figure 3 :
Figure 3: One and two dimensional distributions of the density function given by Eq. 6. Histograms are constructed using 10 5 i.i.d samples.

Figure 4 :
Figure4: Summary of the benchmark runs for the target density function given by Eq. 6.The different colors represent runs with different numbers of partitions; the number of subspaces is denoted by S. The vertical axis is common for all subplots and gives the ratio of the total number of samples generated per single run to the number of samples that are generated if no space partition is performed (N ref = 3.3 • 10 4 ).Left subplot: The horizontal axis shows the ratio of the time spent on sampling and integration to the time that a single worker spent if no space partition is performed (t ref = 14.5 s).The lines are from linear fits of measurements.Middle subplot: The horizontal axis shows the ratio of the integral to the true value; error bars are obtained from the integration algorithm.Right subplot: The horizontal axis illustrates the ratio of the effective number of samples (separately for each dimension) to the total number of samples.An effective number of samples is estimated per dimension and the error bars represent the standard deviation across dimensions.Dashed colored lines represent the average fraction over the runs with the same number of space partitions.

Figure 5 :
Figure 5: The figure illustrates sampling rate (upper subplot) and per-chain sampling rate (lower subplot) versus the number of space partitions.The gray lines represent average over 3 runs.

Figure 6 :
Figure 6: The ratio of the integral to the true value (left panel) and the average number of effective MCMC samples (right panel) for the different numbers of subspaces and sampling times.In the right panel, the values are referenced to the average effective number of samples generated for one subspace and a running time of 3 seconds: N ref = 267.

Figure 7 :
Figure 7: The plot illustrates histograms of the p-values from the two-sample Kolmogorov-Smirnov test to verify whether MCMC and i.i.d samples come from the same distribution.Different colors represent different numbers of space partitions.One set of MCMC samples results in 9 p-values for every dimension.

Figure 8 :
Figure 8: The figure illustrates the results of the classifier two-sample test performed to distinguish i.i.d samples and MCMC samples with space partitioning.The left subplot shows a modified receiver operating characteristic (ROC) where TPR stands for true positive rate and FPR for false positive rate.Different lines correspond to a different number of subspaces (S).The right subplot shows area under the ROC curve versus the number of samples.
5. It was not possible to detect this