Statistics and Computing

, Volume 26, Issue 1–2, pp 383–392 | Cite as

A statistical test for Nested Sampling algorithms

Article

Abstract

Nested sampling is an iterative integration procedure that shrinks the prior volume towards higher likelihoods by removing a “live” point at a time. A replacement point is drawn uniformly from the prior above an ever-increasing likelihood threshold. Thus, the problem of drawing from a space above a certain likelihood value arises naturally in nested sampling, making algorithms that solve this problem a key ingredient to the nested sampling framework. If the drawn points are distributed uniformly, the removal of a point shrinks the volume in a well-understood way, and the integration of nested sampling is unbiased. In this work, I develop a statistical test to check whether this is the case. This “Shrinkage Test” is useful to verify nested sampling algorithms in a controlled environment. I apply the shrinkage test to a test-problem, and show that some existing algorithms fail to pass it due to over-optimisation. I then demonstrate that a simple algorithm can be constructed which is robust against this type of problem. This RADFRIENDS algorithm is, however, inefficient in comparison to MULTINEST.

Keywords

Nested sampling MCMC Bayesian inference Evidence Test Marginal likelihood 

1 Introduction to Nested Sampling

For Bayesian model comparison, the key quantity of interest is the marginal likelihood,
$$\begin{aligned} Z=\int \mathcal{L}(\theta )\cdot p(\theta )\, d\theta . \end{aligned}$$
It is the integral of the likelihood function \(\mathcal{L}\) over a parameter space whose measure is given by the prior. The nested sampling integration framework (Skilling 2004) computes this integral for problems. The strength of nested sampling not only lies with high-dimensional integration, but also peculiar and multi-modal likelihood function shapes can be readily handled, which pose difficulty for other approaches. Nested sampling integrates by tracking how the part of the prior volume reduces that is above a likelihood threshold. Like with the layers of a Mayan pyramid, the reduction in area in a step, multiplied by the current step height will approximate the total volume inside by summation, regardless of the shape of each layer. The novelty is in how the shrinking of the prior volume is tracked.

For mathematical simplicity, I will consider the unit hypercube as the (initial) prior volume. Other priors can be mapped using the inverse of the cumulative prior distribution, allowing broad applicability in practice.

For one-dimensional analogy of the prior shrinkage method of nested sampling, consider the unit interval as the prior volume. If the interval is populated randomly uniformly by \(N\) points, than the space \(S\) below the lowest point is given by order statistics of order \(N\) via the \(\beta \) distribution: \(S\sim \mathrm {Beta}(N,\,1)\), or \(p(S)=N\cdot (1-S)^{N-1}\), with the expectation value \(\left\langle S\right\rangle =\left( N+1\right) ^{-1}\)1

If the interval above this lowest point is again filled with \(N\) uniformly distributed points, we are in the same situation as at the start, with the prior volume shrinking at each step by \(\left( N+1\right) ^{-1}\), until it is \(\left( 1-\frac{1}{N+1}\right) {}^{k}\) after \(k\) steps. In this fashion, the size of the prior volume is known on average. For multi-dimensional applicability, what is missing is a unique and sensible definition of the ordering. Nested sampling employs the likelihood function values for this ordering.

To summarise, the integral \(Z\) is computed by
  1. 1.

    Randomly drawing \(N\) points from the parameter space. Set \(k=0\).

     
  2. 2.
    Identifying the point of lowest likelihood as \(\mathcal{L}_{k}\) and adding its contribution (prior shrinkage volume at this step, times \(\mathcal{L}_{k}\)) to \(Z\):
    $$\begin{aligned} Z\approx \sum _{k=1}^{\infty }\left( 1-\frac{1}{N+1}\right) {}^{k-1}\times \frac{1}{N}\times \mathcal{L}_{k} \end{aligned}$$
     
  3. 3.

    Replacing this point by a randomly drawn point subject to having a higher likelihood value than \(\mathcal{L}_{k}\). Increment \(k\).

     
Steps 2 and 3 are repeated. This sum can be bounded by a statistical uncertainty at every iteration step and converges (see Evans 2007; Chopin and Robert 2007; Skilling 2009; Chopin and Robert 2010), so that the iteration can be stopped when the desired accuracy is reached. If the likelihood is defined via slow-to-compute numerical models, as often the case in the physical sciences, this poses an additional constraint on the number of likelihood evaluations.

Nested sampling hinges (step 3) on a method to randomly draw points which exceed a minimal likelihood value. This is known as sampling under a constrained prior, or constrained sampling for short here. This matter is not trivial. With peculiar shapes of the likelihood function, multi-modality or increased dimensionality, the volume of interest is tiny, and difficult to identify and navigate. We explore approaches and sources of errors in the following section.

2 Constrained sampling

Constrained sampling, i.e. drawing from the prior but above a likelihood threshold, has been solved in two ways, which I call local steps and region sampling. Both employ the fact that the \(N\) “live” points already lie inside the relevant sub-volume, and only another point with such properties has to be found. Here, I discuss the potential flaws of each method.

The first method, local steps, starts a random walk from such a point. After a number of Metropolis steps, by which points with lower likelihood than required are not visited, a useful independent prior sample is obtained. This is only the case if enough steps are made, such that the random walk can reach all of the relevant volume. But if the local proposal distribution is concentrated, and few steps are made, only the neighbouring volume of the start point is sampled. A test for detecting such a condition would be to observe the distance between end point and existing live points. In a limited number of geometrically simple problems, the distribution of distance to nearest neighbour (under uniform sampling) is known, such that a constrained sampling algorithm can be checked for correctness under such a constructed problem. An additional limitation is that distance metrics become less useful in higher dimensions. In practise, I have found that such a test is less sensitive than the one presented below.

Examples of this constrained sampling approach are Markov Chain Monte Carlo (MCMC) with a Gaussian proposal, Hamiltonian Constrained Nested Sampling and its special approximating case Galilean Nested Sampling, and Slice sampling (see Skilling 2004; Betancourt 2011; Skilling 2012; Aitken and Akman 2013, respectively).

The second method for solving constrained sampling, region sampling, is to guess where the permitted region lies, and draw from the prior directly. Such a guess is augmented by the live points, which trace out the likelihood constraint contour. The most well-known algorithm for such an approach is MULTINEST (Feroz and Hobson 2008; Feroz et al. 2009, 2013). Using a clustering algorithm, MULTINEST encapsulates the live points in a number of hyperellipses, and draws only inside these regions. Aside from a long list of successful applications of the MULTINEST algorithm in particle physics, cosmology and astronomy, a single problematic case has been discovered in Beaujean and Caldwell (2013) and analysed in Feroz et al. (2013). Under this perhaps pathological, but physics-motivated likelihood definition, the MULTINEST algorithm consistently gives incorrect results. What then can be sources of such a problem?

When constructing the sampling region, two errors can be made. The sampling region may contain space that falls below the likelihood threshold. This results in sampled points that are not useful and have to be rejected. This rejection sampling affects the number of likelihood function evaluations. In high-dimensional problems, the spaces grow quickly, such that the fraction of useless points can become prohibitive. In practice, the MULTINEST algorithm works inefficiently beyond \(\sim 20\) dimensions (Feroz and Hobson 2008). However, contrary to the “local steps” method above, the points obtained are guaranteed to be drawn uniformly from the sampling region by construction.

The second and more severe type of error is the inadvertent exclusion of relevant volume from the constructed sampling region. This under-estimation of the prior space can lead to biased likelihood draws, either to higher or lower values, depending on the problematic situation. To avoid this problem, the sampling region is typically expanded by a constant growth factor. But can such an algorithmic problem be detected, at least in constructed test problems? I present a statistical test, the Shrinkage Test.

3 The shrinkage test

The shrinkage of the prior volume in nested sampling is known: \(1/N\) of the volume is supposed to be removed. If the shrinkage is accelerated by inadvertently missing a sampling region, this is no longer true.

Let us thus construct test problems where the likelihood constraint contour is known for each removed point, as well as the volume contained. If we compute the ratio of volumes at each step, we can compare it to the expectation of
$$\begin{aligned} \left\langle t_{i}\right\rangle =\left\langle \frac{V_{i}}{V_{i+1}}\right\rangle =\frac{N}{N+1}. \end{aligned}$$
Any test problems can be used where the size of the constraint region, \(V_{i}\), can be computed for the current likelihood value. For instance, for a Gaussian likelihood, the geometric volume formula of an ellipse is applicable. But the simplest test problem is one where at each likelihood value the contour is a hyper-rectangle. This is the case for the “hyper-pyramid” likelihood function,
$$\begin{aligned} \ln L=-\left( \sup _{i}\left| \frac{x_{i}-\frac{1}{2}}{\sigma _{i}}\right| \right) ^{1/s}. \end{aligned}$$
Here, \(s\) controls the slope of the likelihood and \(\sigma _{i}\) defines the scales in each dimension. In this problem, the contours are given directly by \(L\), as \(x_{i}=[r_{0}-\frac{1}{2},\, r_{0}+\frac{1}{2}]\) with \(r_{0}=(-\ln L)^{s}=\sup _{i}\left| \frac{x_{i}-\frac{1}{2}}{\sigma _{i}}\right| \). The corresponding volume is associated with a hyper-rectangle, i.e. \(V=(2\cdot r_{0})^{d}\times \prod _{i}\sigma _{i}\).

The distribution of the volume shrinkage \(t_{i}=\frac{V_{i+1}}{V_{i}}\), is given by \(p(t;\, N)\sim (1-t)^{N-1}\), which can be described by a beta distribution with the shape parameters \(\alpha =N\) and \(\beta =1\). Its cumulative distribution is thus simply \(t^{N}\). This function is cornered at \(R\approx 1\) for reasonable values of \(N\) (\(\sim 400\)). For nicer visualisation, lets consider the border that is being cut away: \(S=1-t^{1/d}\). The expected cumulative distribution on \(S\) is then \(p(<S)=1-(1-S)^{d\cdot N}\).

To test conformity with uniform sampling, the constrained sampling algorithm is applied for many iterations (e.g. 10,000). Using the sequence of removed points, the removed volume \(S\) is computed and compared to and the expected cumulative distribution. The frequency of deviations between the theoretical and obtained distribution can be assessed visually. As the number of samples can be increased, discrepancies should become clear. For quantification of the distance, e.g. the Kolmogorov-Smirnov (KS) test can be applied.

When applying the test in this work, I will use \(s=100\) and \(\sigma _{i}=1\) (hyper-cube contours). However, this test can simulate a wide variety of shapes including problems with multiple scales (e.g. with \(\sigma _{i}=10^{-3i/d}\)), or Gaussian likelihoods where the contours are hyper-ellipses. The case of multiple modes can also be considered. It should be stressed that the dimensionality of the test can be chosen, and varied to analyse the algorithm of interest.

4 Application of the shrinkage test

Lets now verify whether the MULTINEST algorithm, with commonly used parameters, passes the Shrinkage test. Other algorithms are considered later in Sect. 7. I use version 3.4 of the MULTINEST library (Feroz and Hobson 2008; Feroz et al. 2009). I set the sampling efficiency to \(30\,\%\), and the maximum number of modes to 100. I use two configurations, with 400 and 1,000 live points, and without considering importance nested sampling (see Feroz et al. 2013).

The Shrinkage test using the hyper-pyramid likelihood (see previous Section) is applied. I consider 2, 7 and 20 dimensions, and run nested sampling up to a tiny tolerance to avoid premature termination. In each of the first 10,000 iterations the newly sampled point is stored. Using a number of such sequences, I compute the empirical distribution of the shrinkage \(S\), and plot it against the theoretical distribution. This is shown in Fig. 1 for the 20-dimensional case. I find that in 2 dimensions, the distributions match, but in 7 and 20 dimensions, the shrinkage \(S\) tends to lie at higher values. This indicates that too much space is being cut away. This test thus shows, by discrepancy of the theoretical and real shrinkage of the prior volume, that the MULTINEST algorithm under-estimates the volume for this test problem, and samples from a smaller region. We have thus identified a potential source of error relevant also for the problem of Beaujean and Caldwell (2013).
Fig. 1

Shrinkage test results. The MULTINEST algorithm running in 20 dimensions is analysed. The panels show the distribution of the shrinkage border (histogram in the top panel, cumulative distribution in the bottom panel). The observed distribution (black) is shifted to higher values compared to the theoretical distribution (red). This indicates that too much space is being cut away. The vertical lines indicate the means of the distributions. (Color figure online)

5 Robustness against accelerated shrinking

Can we then devise a rejection algorithm that does not suffer from the problem of shrinking too quickly? Here I present an approach that gives some correctness guarantees, but does not emphasise efficiency, particularly in high dimensions. Here I exploit again the live points, but also use the property that they are already uniformly distributed. The next point ought to be in their neighbourhood too, where neighbourhood is defined by having at most distance \(R\) to a live point (this donates the definition of the sampling region). In particular, the method should be robust so that every live point could be sampled if it was not known. A initial idea is to leave each point out in turn, compute the distance to its nearest neighbour, and use the maximum of this quantity as \(R\). Such a jackknife scheme is quite robust, as all points are closer than \(R\) to a live point. However, had the point donating the maximum \(R\) not been in the sample, it could not be obtained. I thus go further and employ a bootstrapping-like method, which I describe now in detail.

6 The RADFRIENDS algorithm

The RadFriends constrained sampling algorithm has to sample a new live point subject to the constraint that it has a higher likelihood value than \(L_{\text {min}}\). It proceeds as laid out in the draw_constrained in Listing 1. The compute_R procedure computes the aforementioned \(R\), which is the largest distance to a neighbour. Here a bootstrap-like procedure is employed to generate a conservative estimate of \(R\) by always leaving points out, and ensuring they could be sampled. This distance \(R\) is then used to define the region around the live points to sample from.

The sampling procedure draw_near can then be done in two ways, which are equivalent with regards to the number of likelihood evaluations and properties of the generated samples. Both are shown in Algorithm 2. The simpler method is to sample a random point from the prior and check if it is within distance of at least one live point. If not, the procedure is repeated. The second method is to choose a random live point, and to generate a random point that fulfils the distance criterion by construction (see caption of Algorithm 2). The so-generated point must only be accepted with probability \(1/m\), where \(m\) is the number of live points within distance \(R\), to avoid preference to clustered regions. The second method is more efficient than the first if the remaining volume is small, as otherwise many points are rejected.

The remaining choice is which norm to use to define the distance. Here I consider the Euclidean (\(L_{2}\)) norm \(\left\| x\right\| \), and the supremum (\(L_{\infty }\)) norm \(\sup \left| x\right| \) (see Listing 2). I term the variant of RadFriends that uses the supremum norm SupFriends.

6.1 Analysis of the emergent properties

Figure 2 illustrates the behaviour of the constructed sampling region under live points sampled from various likelihood contours (green) in each column. The algorithm adapts its sampling region (red and orange contours for the euclidean and supremum norm respectively) to the existing points. Increasing the number of live points tightens the sampling region. It can also be observed that when one live point is far away from the others, the sampling region is large, when they are close together, it tightens.
Fig. 2

Examples of the sampling regions for the RADFRIENDS algorithm, after employing the compute_distance procedure. The blue crosses indicate the live points used for each test case, which are drawn uniformly from the (in practice unknown) likelihood constraint region (green circular lines). The sampling region used by draw_constrained is shown for a Euclidean norm (red line) and a supremum norm (orange). From top to bottom, the number of live points have been increased (50, 100, 200 samples). A general trend of narrowing can be observed. These examples highlight how the algorithm adapts to the peculiar shape of the region of interest (e.g. second and right-most panel), and can handle multiple modes (third to sixth panel) without any assumption on the shape. (Color figure online)

One curious choice in the algorithm is the number of bootstrap iterations (given as 50). It was chosen as follows: The probability to not use a specific live point in an iteration is
$$\begin{aligned} p_{1}=\left( 1-\frac{1}{N}\right) ^{N}\approx 37\% \,\,\hbox {for } N>50. \end{aligned}$$
The probability to having used one particular point in every of the \(m\) iterations, i.e. never having left it out, is
$$\begin{aligned} p_{L}=\left( 1-p_{1}\right) ^{m}. \end{aligned}$$
The probability of having used any of the \(N\) points in every iteration, is \(N\) times higher. Here I neglect the subtraction that this is the case for more than one point, which leads to the upper-bound
$$\begin{aligned} p_{L,all}<\left( 1-p_{1}\right) ^{m}\times N. \end{aligned}$$
This event should be rare, such that it should not be expected more than once in the whole nested sampling run, e.g. with \(10^{6}\) iterations. For values of \(N=100,\,1000,\,10000\), \(p_{L,all}\) reaches the value \(10^{-6}\) at
$$\begin{aligned} m=\frac{\ln \, p_{L,all}-\ln \, N}{\ln \,(1-p_{1})}=39.8,\,44.9,\,49.8. \end{aligned}$$
Thus, the conservative choice of 50 iterations is justified.

Figure 2 already demonstrates that this algorithm can immediately handle multiple modes, as clustering of points is an emergent feature. This yields efficient sampling iff the region in between is excluded. When is this the case? Consider a small cluster with \(k\) points, well separated from the other live points. It will be treated as a separate cluster if one of the members is always selected in the bootstrapping rounds. Leaving out all \(k\) points simultaneously has probability \(p_{k,all}=p_{1}^{k}\times m\). For \(m=50\), and \(k=10,\,20,\,40\), this probability is \(p_{k,all}=0.5,\,0.005,\,5\times 10^{-7}\). In words, one can expect efficient sampling of the sub-cluster if it contains more than 20 points. However, this means that for a problem with e.g. \(20\) well-separated modes, \(20\times 40=800\) live points are needed to safely avoid the inefficient sampling between the modes.

7 Shrinkage test results

Now it is interesting to see whether the RADFRIENDS algorithm can pass the shrinkage test constructed in Sect. 3. Additionally, I report the performance of a number of other algorithms, namely plain rejection sampling, MULTINEST, MULTINEST with importance nested sampling, and MCMC. For the constrained sampling using MCMC, I employ a symmetric Gaussian proposal distribution of initial standard deviation \(0.1\) and test 10, 20 and 50 proposal steps. As the scales shrink, an adaptive rule has to be used for the scale of the proposal distribution. I use the update recipe described in Sivia and Skilling (2006) of
$$\begin{aligned} \sigma '=\sigma \cdot \exp \left( {\left\{ \begin{array}{ll} +1/n_{\text {accepts}} &{} \text {if }n_{\text {accepts}}>n_{\text {rejects}}\\ -1/n_{\text {rejects}} &{} \text {if }n_{\text {accepts}}<n_{\text {rejects}} \end{array}\right. }\right) \end{aligned}$$
For comparison, I use another MCMC algorithm with a fixed Gaussian proposal distribution of standard deviation \(10^{-5}\), but 200 steps.
The results are listed in Table 1. The MCMC algorithm with a tiny, fixed proposal (“mcmc-gauss-scale-5”) fails the Shrinkage test as expected. It samples too close to the existing live points (where it starts) and thus the shrinking is also incorrect. In contrast, the MCMC proposal with an adaptive rule successfully passes the distance distribution test. For the 7 and 20-dimensional case, the p-values of either tests attain low values when using only 10 or 20 steps. Although p-values can be cumbersome to interpret, it is sensible to use at least 50 MCMC steps in the exploration, which yields an efficiency of 2 %.
Table 1

Results of the shrinkage test using the hyper-pyramid likelihood function. The p value of the KS test indicates the expected frequency of the result (values below 0.05 are indicated with a star)

Algorithm

Dim

\(p_{shrinkage}\)

Iterations

Evaluations

Efficiency

Rejection

2

0.7324

32,000

71092909

0.05%

Multinest

2

*0.0474

80,000

256411

31.20%

Radfriends

2

0.9105

80,000

132026

60.59%

Supfriends

2

0.5321

80,000

131505

60.83%

Mcmc-gauss-50-adapt

2

0.1961

80,000

4000000

2.00%

Mcmc-gauss-20-adapt

2

0.1566

80,000

1600000

5.00%

Mcmc-gauss-10-adapt

2

0.0732

80,000

800000

10.00%

Mcmc-gauss-scale-5

2

*0.0000

80,000

16000000

0.50%

Rejection

7

0.5707

32,000

74035891

0.04%

Multinest

7

*0.0000

80,000

393575

20.33%

Radfriends

7

0.2651

80,000

2711519

2.95%

Supfriends

7

0.0965

80,000

3483200

2.30%

Mcmc-gauss-50-adapt

7

0.3643

80,000

4000000

2.00%

Mcmc-gauss-20-adapt

7

*0.0273

80,000

1600000

5.00%

Mcmc-gauss-10-adapt

7

*0.0000

80,000

800000

10.00%

Mcmc-gauss-scale-5

7

*0.0000

80,000

16000000

0.50%

Rejection

20

0.5183

32,000

65401030

0.05%

Multinest

20

*0.0000

32,000

499209

6.41%

Radfriends

20

0.2954

32,000

26129495

0.12%

Supfriends

20

0.6573

32,000

39067739

0.08%

Mcmc-gauss-50-adapt

20

0.8785

32,000

1600000

2.00%

Mcmc-gauss-20-adapt

20

0.4475

32,000

640000

5.00%

Mcmc-gauss-10-adapt

20

*0.0000

32,000

320000

10.00%

Mcmc-gauss-scale-5

7

*0.0000

80,000

16000000

0.50%

In each algorithm, 400 live points were used. The rejection sampling is run for fewer iterations as its efficiency drops rapidly. For exploration with MCMC, the value indicates the number of proposal steps used (10, 20 or 50)

In 7 and 20 dimensions, the shrinkage distribution of the MULTINEST algorithm shows deviations, as remarked before, and shown in Fig. 1. For comparison, the rejection sampling and RADFRIENDS algorithm (shown in Fig. 3) yield the correct distribution.
Fig. 3

Same as Figure 1 but for the SupFriends algorithm in 20 dimensions. Here, the distributions are in agreement

Table 1 also shows that the MULTINEST algorithm is highly efficient. In typical applications, the MULTINEST algorithm uses one or up to two orders of magnitude fewer likelihood evaluations than the RADFRIENDS/SUPFRIENDS algorithm.

8 Test problems

In this section, I analyse the correctness and efficiency of the RADFRIENDSalgorithm numerically. A number of common test integration problems have been verified, however for brevity only two are presented here, which expose the advantages and disadvantages best. For comparison, I include results from using MULTINEST with and without Importance Nested Sampling (Feroz et al. 2013). I run each algorithm 10 times, and record the average integral value, \(\hat{Z}\), the actual variance of this estimator,  \(A^{2}\), and the average statistical uncertainty reported, \(C\).

8.1 Eggbox problem

The eggbox problem is adapted from Feroz et al. (2009). It is only two-dimensional, but contains 18 distinct peaks, posing extreme multi-modality. The likelihood, visualised in Fig. 4 (left panel), can be defined on a unit square as
$$\begin{aligned} \ln \, L=\left( 2+\cos (5\pi \cdot x_{1})\cdot \cos (5\pi \cdot x_{2})\right) ^{5} \end{aligned}$$
Fig. 4

Visualisation of the considered problems in the first two coordinates, using arbitrarily chosen contours (blue lowest, red highest). Both the eggbox problem (left panel) and the LogGamma problem (right panel) show multi-modality. For the latter, the contours are asymmetric. In higher dimensions, the LogGamma problem is extended with independent Normal and LogGamma distributions in alternation (Color figure online)

Results are shown in Fig. 5. Both MULTINEST and RADFRIENDS integrate this problem successfully. As appreciated in Sect. 6.1, RADFRIENDS can separate out modes when a higher number of live points is used, making it more efficient. MULTINEST uses the lowest number of likelihood evaluations.
Fig. 5

Performance results for the eggbox problem. Each algorithm is listed with the mean \(\ln \, Z\) indicated as a point. For the uncertainties, the uncertainty of the estimate computed by the algorithm is shown (black error bars). Grey error bars show the estimators’ actual scatter around the true value \(\ln \, Z_{true}=235.88\) (vertical red dashed line). For each algorithm the total number of likelihood function evaluations are listed. Here, all algorithms give the correct answer. The RadFriends algorithm with 400 live points yield a much lower efficiency than when using 1,000 live points. This is due to the many modes not being separated (see Sect. 6.1). The most efficient algorithm is MultiNest with 400 live points (1–2 orders of magnitudes faster than RadFriends). (Color figure online)

8.2 LogGamma problem

This problem is adapted from Beaujean and Caldwell (2013) and acknowledged to be problematic by the MULTINEST authors (Feroz et al. 2013). A combination of LogGamma and Gaussian distributions is considered, defining the likelihood \(L\) as
$$\begin{aligned} g_{a}&\sim \mathrm {LogGamma}\left( 1,\,\frac{1}{3},\,\frac{1}{30}\right) \\ g_{b}&\sim \mathrm {LogGamma}\left( 1,\,\frac{2}{3},\,\frac{1}{30}\right) \\ n_{c}&\sim \mathrm {Normal}\left( \frac{1}{3},\,\frac{1}{30}\right) \\ n_{d}&\sim \mathrm {Normal}\left( \frac{2}{3},\,\frac{1}{30}\right) \\ d_{i}&\sim \mathrm {LogGamma}\left( 1,\,\frac{2}{3},\,\frac{1}{30}\right) \,\,\,\ \hbox {if}\,\,\, 3\le i\le \frac{d+2}{2}\\ d_{i}&\sim \mathrm {Normal}\left( \frac{2}{3},\,\frac{1}{30}\right) \quad \hbox {if}\,\,\,\frac{d+2}{2}<i\\ L_{1}&= \frac{1}{2}\left( g_{a}(x_{1})+g_{b}(x_{1})\right) \\ L_{2}&= \frac{1}{2}\left( n_{c}(x_{2})+n_{d}(x_{2})\right) \\ L&= L_{1}\times L_{2}\times \prod _{i=3}^{d}d_{i}(x_{i}). \end{aligned}$$
The dimensionality of the problem is donated by \(d\). We consider the cases of \(d=2\) and \(d=10\) here. This problem combines well-separated peaks with asymmetric heavy-tailed distributions, as shown in Fig. 4. The true integral value is \(\ln \, Z_{\text {true}}=0\).
The results are shown in Fig. 6 for the two-dimensional case and Fig. 7 for ten dimensions. The two-dimensional problem can be solved correctly (i.e. within the constraints) by all algorithms. However, the Importance Nested Sampling of MULTINEST claims a higher accuracy (by a factor of \(\sim 5\)) than actually achieved. This effect has been noted before in Feroz et al. (2013).
Fig. 6

Performance results for the LogGamma problem in 2 dimensions. Each algorithm is listed with the mean \(\ln \, Z\) indicated as a point. For the uncertainties, the uncertainty of the estimate computed by the algorithm is shown (black error bars). Grey error bars show the estimators’ actual scatter around the true value \(\ln \, Z_{true}=0\) (vertical red dashed line). For each algorithm the total number of likelihood function evaluations are listed. All algorithms give correct results. However, MULTINEST with Importance Nested Sampling claims a much smaller uncertainty than actually achieved, excluding the true value. (Color figure online)

Fig. 7

As Table 6, but in 10 dimensions. Here, MULTINEST overestimates the evidence. Enabling Importance Nested Sampling reduces the over-estimate. The RadFriends/SupFriends algorithms yield the correct results, and do not show any bias. Using the supremum norm requires an order of magnitude more evaluations

The 10-dimensional problem demonstrates what happens when the algorithms begin to break. Without Importance Nested Sampling, the computation terminates, but the found integral value is over-estimated. With Importance Nested Sampling enabled, MULTINEST mitigates the overestimation to sufficient degree. Both RADFRIENDS and SUPFRIENDS compute the evidence correctly, which shows that this problem can be solved by standard nested sampling. SUPFRIENDS requires one magnitude more evaluations than RADFRIENDS, which indicates that the choice of the norm has a strong influence for problems of higher dimensionality.

9 Conclusions

I have presented a brief overview of algorithms for sampling under a constrained prior, which are a key ingredient in nested sampling, and employed to compute integrals for high-dimensional model comparison. I have explored the sources of errors in such algorithms and devised a test to uncover sampling errors.

The Shrinkage Test uncovers algorithms that violate the expectation of nested sampling in how the prior volume shrinks. Such problematic algorithms accelerate the shrinking, leaving out relevant parameter space, which leads to incorrect computation of the integral.

Although the Shrinkage Test is limited to geometrically well-understood likelihood functions with geometrically simple contours (such as Gaussian likelihoods, or the hyper-pyramid used here), it can be used to verify the correctness on high-dimensional problems, multi-modal likelihoods, and shapes of multiple scale lengths. Thus, it capable of simulating a wide range of situations that occur in practise.

I apply the Shrinkage Test to the popular MULTINEST algorithm, and find that it fails in the 7 and 20-dimensional cases. This indicates that in the studied case, relevant prior volume is left out. This type of error may also the source for not integrating the LogGamma problem correctly.

I then present an algorithm termed RADFRIENDS, which is constructed to be robust against this type of problem. Studying the properties, I find that RADFRIENDS
  1. 1.

    passes the Shrinkage Test,

     
  2. 2.

    solves the LogGamma problem and others correctly, and

     
  3. 3.

    can handle multi-modal problems and peculiar shapes without tuning parameters or additional input information.

     
However, this algorithm is one or two orders of magnitudes less efficient than MULTINEST by number of likelihood evaluations. This algorithms suffers from the curse of dimensionality and is thus not useful for \(>10\) dimensions, save for verifying test problems with fast-to-compute likelihoods. For low-dimensional problems, it can, however, compete with MULTINEST.

The presented algorithm is simple to implement, and can be understood analytically. It is thus proposed as safe, easy-to-implement baseline algorithm for low-dimensional problems.

In a similar spirit, the method of Mukherjee et al. (2006) and the MULTINEST algorithm could be made more robust. We suggest leaving a fraction of the live points out when constructing the ellipsoids. The ellipsoids should then be expanded to such a degree that the left-out live points are included. This can be done a few times to obtain a robust ellipsoid expansion factor, on-line.

10 Future work

The region sampling type of constrained sampling algorithms, which constructs a sampling region from the live points, requires further study, especially in the high-dimensional regime. For instance, machine learning algorithms, such as Support Vector Machines, may be useful to learn the border between live points and already discarded points. Improvements and further studies of the simple RADFRIENDS algorithm are also left to future work. For example, applying Importance Nested Sampling (Cameron and Pettitt 2013) in RADFRIENDS is directly analogous to how it was developed for MULTINEST in Feroz et al. (2013). The study of the impact of the distance measure, and alternative norms may also be useful for higher dimensional problems.

The option of combining region sampling and local step methods into hybrid algorithms should be explored to combine their respective power. For instance, the permissible region from RADFRIENDS may be used as a restrict the proposal distribution of Markov Chain Monte Carlo, or its hyper-spheres may be used as reflection surfaces for Galilean Monte Carlo. The scale-size of the region (\(R\)) can also be used to tune the step size. Such a RadFriends/MCMC hybrid method written in C, named UltraNest, is available at http://johannesbuchner.github.io/nested-sampling/UltraNest/. A framework for developing and testing nested sampling algorithms in Python is available at http://johannesbuchner.github.io/nested-sampling/, for which we welcome contributions. A reference implementation of RADFRIENDS can also be found there.

Footnotes

  1. 1.

    Skilling (2004) uses the estimator \(\left\langle \ln S\right\rangle =-1/N\), which is better behaved at small N. For this introduction the simpler, intuitive formula is sufficient.

Notes

Acknowledgments

I would like to thank Frederik Beaujean and Udo von Toussaint for reading the initial manuscript. I acknowledge funding through a doctoral stipend by the Max Planck Society. This manuscript has greatly benefited from the comments of the two anonymous referees, whom I would also like to thank. I acknowledge financial support through a Max Planck society stipend.

References

  1. Aitken, S., Akman, O.E.: Nested sampling for parameter inference in systems biology: application to an exemplar circadian model. BMC Syst. Biol. 7(1), 72 (2013)CrossRefGoogle Scholar
  2. Beaujean, F., Caldwell, A.: Initializing adaptive importance sampling with Markov chains. ArXiv (e-prints) (2013)Google Scholar
  3. Betancourt, M.: Nested sampling with constrained Hamiltonian Monte Carlo. In Mohammad-Djafari, A., Bercher, J.-F., & Bessiére, P. (eds.) American Institute of Physics Conference Series, vol. 1305, pp. 165–172. American Institute of Physics Conference Series (2011)Google Scholar
  4. Cameron, E., Pettitt, A.: Recursive pathways to marginal likelihood estimation with prior-sensitivity analysis. ArXiv e-prints (2013)Google Scholar
  5. Chopin, N., Robert, C.: Comments on nested sampling by john skilling. Bayesian Stat. 8, 491–524 (2007)Google Scholar
  6. Chopin, N., Robert, C.P.: Properties of nested sampling. Biometrika. (2010)Google Scholar
  7. Evans, M.: Discussion of nested sampling for bayesian computations by john skilling. Bayesian Stat. 8, 491–524 (2007)Google Scholar
  8. Feroz, F., Hobson, M.P.: Multimodal nested sampling: an efficient and robust alternative to Markov Chain Monte Carlo methods for astronomical data analyses. MNRAS 384, 449–463 (2008)CrossRefGoogle Scholar
  9. Feroz, F., Hobson, M.P., Bridges, M.: MULTINEST: an efficient and robust Bayesian inference tool for cosmology and particle physics. MNRAS 398, 1601–1614 (2009)CrossRefGoogle Scholar
  10. Feroz, F., Hobson, M. P., Cameron, E., Pettitt, A. N.: Importance nested sampling and the MultiNest algorithm. ArXiv e-prints. (2013)Google Scholar
  11. Mukherjee, P., Parkinson, D., Liddle, A.R.: A nested sampling algorithm for cosmological model selection. ApJ 638, L51–L54 (2006)CrossRefGoogle Scholar
  12. Sivia, D., Skilling, J.: Data Analysis: A Bayesian Tutorial. Oxford science publications. Oxford University Press, Oxford (2006)Google Scholar
  13. Skilling, J.: Nested sampling. In: AIP Conference Proceedings, vol. 735, p. 395. (2004)Google Scholar
  14. Skilling, J.: Nested sampling’s convergence. In: BAYESIAN INFERENCE AND MAXIMUM ENTROPY METHODS IN SCIENCE ANDENGINEERING. The 29th International Workshop on Bayesian Inference andMaximum Entropy Methods in Science and Engineering, vol. 1193, pp. 277–291. AIP Publishing, New York (2009)Google Scholar
  15. Skilling, J.: Bayesian computation in big spaces-nested sampling and galilean monte carlo. AIP Conf. Proc. 1443(1), 145–156 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Max Planck Institut für Extraterrestrische PhysikGarchingGermany

Personalised recommendations