# A statistical test for Nested Sampling algorithms

- 375 Downloads
- 2 Citations

## Abstract

Nested sampling is an iterative integration procedure that shrinks the prior volume towards higher likelihoods by removing a “live” point at a time. A replacement point is drawn uniformly from the prior above an ever-increasing likelihood threshold. Thus, the problem of drawing from a space above a certain likelihood value arises naturally in nested sampling, making algorithms that solve this problem a key ingredient to the nested sampling framework. If the drawn points are distributed uniformly, the removal of a point shrinks the volume in a well-understood way, and the integration of nested sampling is unbiased. In this work, I develop a statistical test to check whether this is the case. This “Shrinkage Test” is useful to verify nested sampling algorithms in a controlled environment. I apply the shrinkage test to a test-problem, and show that some existing algorithms fail to pass it due to over-optimisation. I then demonstrate that a simple algorithm can be constructed which is robust against this type of problem. This RADFRIENDS algorithm is, however, inefficient in comparison to MULTINEST.

### Keywords

Nested sampling MCMC Bayesian inference Evidence Test Marginal likelihood## 1 Introduction to Nested Sampling

For mathematical simplicity, I will consider the unit hypercube as the (initial) prior volume. Other priors can be mapped using the inverse of the cumulative prior distribution, allowing broad applicability in practice.

For one-dimensional analogy of the prior shrinkage method of nested sampling, consider the unit interval as the prior volume. If the interval is populated randomly uniformly by \(N\) points, than the space \(S\) below the lowest point is given by order statistics of order \(N\) via the \(\beta \) distribution: \(S\sim \mathrm {Beta}(N,\,1)\), or \(p(S)=N\cdot (1-S)^{N-1}\), with the expectation value \(\left\langle S\right\rangle =\left( N+1\right) ^{-1}\)^{1}

If the interval above this lowest point is again filled with \(N\) uniformly distributed points, we are in the same situation as at the start, with the prior volume shrinking at each step by \(\left( N+1\right) ^{-1}\), until it is \(\left( 1-\frac{1}{N+1}\right) {}^{k}\) after \(k\) steps. In this fashion, the size of the prior volume is known on average. For multi-dimensional applicability, what is missing is a unique and sensible definition of the ordering. Nested sampling employs the likelihood function values for this ordering.

- 1.
Randomly drawing \(N\) points from the parameter space. Set \(k=0\).

- 2.Identifying the point of lowest likelihood as \(\mathcal{L}_{k}\) and adding its contribution (prior shrinkage volume at this step, times \(\mathcal{L}_{k}\)) to \(Z\):$$\begin{aligned} Z\approx \sum _{k=1}^{\infty }\left( 1-\frac{1}{N+1}\right) {}^{k-1}\times \frac{1}{N}\times \mathcal{L}_{k} \end{aligned}$$
- 3.
Replacing this point by a randomly drawn point subject to having a higher likelihood value than \(\mathcal{L}_{k}\). Increment \(k\).

Nested sampling hinges (step 3) on a method to randomly draw points which exceed a minimal likelihood value. This is known as sampling under a constrained prior, or constrained sampling for short here. This matter is not trivial. With peculiar shapes of the likelihood function, multi-modality or increased dimensionality, the volume of interest is tiny, and difficult to identify and navigate. We explore approaches and sources of errors in the following section.

## 2 Constrained sampling

Constrained sampling, i.e. drawing from the prior but above a likelihood threshold, has been solved in two ways, which I call *local steps* and *region sampling*. Both employ the fact that the \(N\) “live” points already lie inside the relevant sub-volume, and only another point with such properties has to be found. Here, I discuss the potential flaws of each method.

The first method, *local steps*, starts a random walk from such a point. After a number of Metropolis steps, by which points with lower likelihood than required are not visited, a useful independent prior sample is obtained. This is only the case if enough steps are made, such that the random walk can reach all of the relevant volume. But if the local proposal distribution is concentrated, and few steps are made, only the neighbouring volume of the start point is sampled. A test for detecting such a condition would be to observe the distance between end point and existing live points. In a limited number of geometrically simple problems, the distribution of distance to nearest neighbour (under uniform sampling) is known, such that a constrained sampling algorithm can be checked for correctness under such a constructed problem. An additional limitation is that distance metrics become less useful in higher dimensions. In practise, I have found that such a test is less sensitive than the one presented below.

Examples of this constrained sampling approach are Markov Chain Monte Carlo (MCMC) with a Gaussian proposal, Hamiltonian Constrained Nested Sampling and its special approximating case Galilean Nested Sampling, and Slice sampling (see Skilling 2004; Betancourt 2011; Skilling 2012; Aitken and Akman 2013, respectively).

The second method for solving constrained sampling, *region sampling*, is to guess where the permitted region lies, and draw from the prior directly. Such a guess is augmented by the live points, which trace out the likelihood constraint contour. The most well-known algorithm for such an approach is MULTINEST (Feroz and Hobson 2008; Feroz et al. 2009, 2013). Using a clustering algorithm, MULTINEST encapsulates the live points in a number of hyperellipses, and draws only inside these regions. Aside from a long list of successful applications of the MULTINEST algorithm in particle physics, cosmology and astronomy, a single problematic case has been discovered in Beaujean and Caldwell (2013) and analysed in Feroz et al. (2013). Under this perhaps pathological, but physics-motivated likelihood definition, the MULTINEST algorithm consistently gives incorrect results. What then can be sources of such a problem?

When constructing the sampling region, two errors can be made. The sampling region may contain space that falls below the likelihood threshold. This results in sampled points that are not useful and have to be rejected. This rejection sampling affects the number of likelihood function evaluations. In high-dimensional problems, the spaces grow quickly, such that the fraction of useless points can become prohibitive. In practice, the MULTINEST algorithm works inefficiently beyond \(\sim 20\) dimensions (Feroz and Hobson 2008). However, contrary to the “local steps” method above, the points obtained are guaranteed to be drawn uniformly from the sampling region by construction.

The second and more severe type of error is the inadvertent exclusion of relevant volume from the constructed sampling region. This under-estimation of the prior space can lead to biased likelihood draws, either to higher or lower values, depending on the problematic situation. To avoid this problem, the sampling region is typically expanded by a constant growth factor. But can such an algorithmic problem be detected, at least in constructed test problems? I present a statistical test, the Shrinkage Test.

## 3 The shrinkage test

The shrinkage of the prior volume in nested sampling is known: \(1/N\) of the volume is supposed to be removed. If the shrinkage is accelerated by inadvertently missing a sampling region, this is no longer true.

The distribution of the volume shrinkage \(t_{i}=\frac{V_{i+1}}{V_{i}}\), is given by \(p(t;\, N)\sim (1-t)^{N-1}\), which can be described by a beta distribution with the shape parameters \(\alpha =N\) and \(\beta =1\). Its cumulative distribution is thus simply \(t^{N}\). This function is cornered at \(R\approx 1\) for reasonable values of \(N\) (\(\sim 400\)). For nicer visualisation, lets consider the border that is being cut away: \(S=1-t^{1/d}\). The expected cumulative distribution on \(S\) is then \(p(<S)=1-(1-S)^{d\cdot N}\).

To test conformity with uniform sampling, the constrained sampling algorithm is applied for many iterations (e.g. 10,000). Using the sequence of removed points, the removed volume \(S\) is computed and compared to and the expected cumulative distribution. The frequency of deviations between the theoretical and obtained distribution can be assessed visually. As the number of samples can be increased, discrepancies should become clear. For quantification of the distance, e.g. the Kolmogorov-Smirnov (KS) test can be applied.

When applying the test in this work, I will use \(s=100\) and \(\sigma _{i}=1\) (hyper-cube contours). However, this test can simulate a wide variety of shapes including problems with multiple scales (e.g. with \(\sigma _{i}=10^{-3i/d}\)), or Gaussian likelihoods where the contours are hyper-ellipses. The case of multiple modes can also be considered. It should be stressed that the dimensionality of the test can be chosen, and varied to analyse the algorithm of interest.

## 4 Application of the shrinkage test

Lets now verify whether the MULTINEST algorithm, with commonly used parameters, passes the Shrinkage test. Other algorithms are considered later in Sect. 7. I use version 3.4 of the MULTINEST library (Feroz and Hobson 2008; Feroz et al. 2009). I set the sampling efficiency to \(30\,\%\), and the maximum number of modes to 100. I use two configurations, with 400 and 1,000 live points, and without considering importance nested sampling (see Feroz et al. 2013).

## 5 Robustness against accelerated shrinking

*could*be sampled if it was not known. A initial idea is to leave each point out in turn, compute the distance to its nearest neighbour, and use the maximum of this quantity as \(R\). Such a jackknife scheme is quite robust, as all points are closer than \(R\) to a live point. However, had the point donating the maximum \(R\) not been in the sample, it could not be obtained. I thus go further and employ a bootstrapping-like method, which I describe now in detail.

## 6 The RADFRIENDS algorithm

The RadFriends constrained sampling algorithm has to sample a new live point subject to the constraint that it has a higher likelihood value than \(L_{\text {min}}\). It proceeds as laid out in the draw_constrained in Listing 1. The compute_R procedure computes the aforementioned \(R\), which is the largest distance to a neighbour. Here a bootstrap-like procedure is employed to generate a conservative estimate of \(R\) by always leaving points out, and ensuring they could be sampled. This distance \(R\) is then used to define the region around the live points to sample from.

The sampling procedure draw_near can then be done in two ways, which are equivalent with regards to the number of likelihood evaluations and properties of the generated samples. Both are shown in Algorithm 2. The simpler method is to sample a random point from the prior and check if it is within distance of at least one live point. If not, the procedure is repeated. The second method is to choose a random live point, and to generate a random point that fulfils the distance criterion by construction (see caption of Algorithm 2). The so-generated point must only be accepted with probability \(1/m\), where \(m\) is the number of live points within distance \(R\), to avoid preference to clustered regions. The second method is more efficient than the first if the remaining volume is small, as otherwise many points are rejected.

The remaining choice is which norm to use to define the distance. Here I consider the Euclidean (\(L_{2}\)) norm \(\left\| x\right\| \), and the supremum (\(L_{\infty }\)) norm \(\sup \left| x\right| \) (see Listing 2). I term the variant of RadFriends that uses the supremum norm SupFriends.

### 6.1 Analysis of the emergent properties

*every*of the \(m\) iterations, i.e. never having left it out, is

*any*of the \(N\) points in

*every*iteration, is \(N\) times higher. Here I neglect the subtraction that this is the case for more than one point, which leads to the upper-bound

Figure 2 already demonstrates that this algorithm can immediately handle multiple modes, as clustering of points is an emergent feature. This yields efficient sampling iff the region in between is excluded. When is this the case? Consider a small cluster with \(k\) points, well separated from the other live points. It will be treated as a separate cluster if one of the members is always selected in the bootstrapping rounds. Leaving out all \(k\) points simultaneously has probability \(p_{k,all}=p_{1}^{k}\times m\). For \(m=50\), and \(k=10,\,20,\,40\), this probability is \(p_{k,all}=0.5,\,0.005,\,5\times 10^{-7}\). In words, one can expect efficient sampling of the sub-cluster if it contains more than 20 points. However, this means that for a problem with e.g. \(20\) well-separated modes, \(20\times 40=800\) live points are needed to safely avoid the inefficient sampling between the modes.

## 7 Shrinkage test results

Results of the shrinkage test using the hyper-pyramid likelihood function. The *p* value of the KS test indicates the expected frequency of the result (values below 0.05 are indicated with a star)

Algorithm | Dim | \(p_{shrinkage}\) | Iterations | Evaluations | Efficiency |
---|---|---|---|---|---|

Rejection | 2 | 0.7324 | 32,000 | 71092909 | 0.05% |

Multinest | 2 | *0.0474 | 80,000 | 256411 | 31.20% |

Radfriends | 2 | 0.9105 | 80,000 | 132026 | 60.59% |

Supfriends | 2 | 0.5321 | 80,000 | 131505 | 60.83% |

Mcmc-gauss-50-adapt | 2 | 0.1961 | 80,000 | 4000000 | 2.00% |

Mcmc-gauss-20-adapt | 2 | 0.1566 | 80,000 | 1600000 | 5.00% |

Mcmc-gauss-10-adapt | 2 | 0.0732 | 80,000 | 800000 | 10.00% |

Mcmc-gauss-scale-5 | 2 | *0.0000 | 80,000 | 16000000 | 0.50% |

Rejection | 7 | 0.5707 | 32,000 | 74035891 | 0.04% |

Multinest | 7 | *0.0000 | 80,000 | 393575 | 20.33% |

Radfriends | 7 | 0.2651 | 80,000 | 2711519 | 2.95% |

Supfriends | 7 | 0.0965 | 80,000 | 3483200 | 2.30% |

Mcmc-gauss-50-adapt | 7 | 0.3643 | 80,000 | 4000000 | 2.00% |

Mcmc-gauss-20-adapt | 7 | *0.0273 | 80,000 | 1600000 | 5.00% |

Mcmc-gauss-10-adapt | 7 | *0.0000 | 80,000 | 800000 | 10.00% |

Mcmc-gauss-scale-5 | 7 | *0.0000 | 80,000 | 16000000 | 0.50% |

Rejection | 20 | 0.5183 | 32,000 | 65401030 | 0.05% |

Multinest | 20 | *0.0000 | 32,000 | 499209 | 6.41% |

Radfriends | 20 | 0.2954 | 32,000 | 26129495 | 0.12% |

Supfriends | 20 | 0.6573 | 32,000 | 39067739 | 0.08% |

Mcmc-gauss-50-adapt | 20 | 0.8785 | 32,000 | 1600000 | 2.00% |

Mcmc-gauss-20-adapt | 20 | 0.4475 | 32,000 | 640000 | 5.00% |

Mcmc-gauss-10-adapt | 20 | *0.0000 | 32,000 | 320000 | 10.00% |

Mcmc-gauss-scale-5 | 7 | *0.0000 | 80,000 | 16000000 | 0.50% |

Table 1 also shows that the MULTINEST algorithm is highly efficient. In typical applications, the MULTINEST algorithm uses one or up to two orders of magnitude fewer likelihood evaluations than the RADFRIENDS/SUPFRIENDS algorithm.

## 8 Test problems

In this section, I analyse the correctness and efficiency of the RADFRIENDSalgorithm numerically. A number of common test integration problems have been verified, however for brevity only two are presented here, which expose the advantages and disadvantages best. For comparison, I include results from using MULTINEST with and without Importance Nested Sampling (Feroz et al. 2013). I run each algorithm 10 times, and record the average integral value, \(\hat{Z}\), the actual variance of this estimator, \(A^{2}\), and the average statistical uncertainty reported, \(C\).

### 8.1 Eggbox problem

### 8.2 LogGamma problem

The 10-dimensional problem demonstrates what happens when the algorithms begin to break. Without Importance Nested Sampling, the computation terminates, but the found integral value is over-estimated. With Importance Nested Sampling enabled, MULTINEST mitigates the overestimation to sufficient degree. Both RADFRIENDS and SUPFRIENDS compute the evidence correctly, which shows that this problem can be solved by standard nested sampling. SUPFRIENDS requires one magnitude more evaluations than RADFRIENDS, which indicates that the choice of the norm has a strong influence for problems of higher dimensionality.

## 9 Conclusions

I have presented a brief overview of algorithms for sampling under a constrained prior, which are a key ingredient in nested sampling, and employed to compute integrals for high-dimensional model comparison. I have explored the sources of errors in such algorithms and devised a test to uncover sampling errors.

The Shrinkage Test uncovers algorithms that violate the expectation of nested sampling in how the prior volume shrinks. Such problematic algorithms accelerate the shrinking, leaving out relevant parameter space, which leads to incorrect computation of the integral.

Although the Shrinkage Test is limited to geometrically well-understood likelihood functions with geometrically simple contours (such as Gaussian likelihoods, or the hyper-pyramid used here), it can be used to verify the correctness on high-dimensional problems, multi-modal likelihoods, and shapes of multiple scale lengths. Thus, it capable of simulating a wide range of situations that occur in practise.

I apply the Shrinkage Test to the popular MULTINEST algorithm, and find that it fails in the 7 and 20-dimensional cases. This indicates that in the studied case, relevant prior volume is left out. This type of error may also the source for not integrating the LogGamma problem correctly.

- 1.
passes the Shrinkage Test,

- 2.
solves the LogGamma problem and others correctly, and

- 3.
can handle multi-modal problems and peculiar shapes without tuning parameters or additional input information.

The presented algorithm is simple to implement, and can be understood analytically. It is thus proposed as safe, easy-to-implement baseline algorithm for low-dimensional problems.

In a similar spirit, the method of Mukherjee et al. (2006) and the MULTINEST algorithm could be made more robust. We suggest leaving a fraction of the live points out when constructing the ellipsoids. The ellipsoids should then be expanded to such a degree that the left-out live points are included. This can be done a few times to obtain a robust ellipsoid expansion factor, on-line.

## 10 Future work

The *region sampling* type of constrained sampling algorithms, which constructs a sampling region from the live points, requires further study, especially in the high-dimensional regime. For instance, machine learning algorithms, such as Support Vector Machines, may be useful to learn the border between live points and already discarded points. Improvements and further studies of the simple RADFRIENDS algorithm are also left to future work. For example, applying Importance Nested Sampling (Cameron and Pettitt 2013) in RADFRIENDS is directly analogous to how it was developed for MULTINEST in Feroz et al. (2013). The study of the impact of the distance measure, and alternative norms may also be useful for higher dimensional problems.

The option of combining *region sampling* and *local step* methods into hybrid algorithms should be explored to combine their respective power. For instance, the permissible region from RADFRIENDS may be used as a restrict the proposal distribution of Markov Chain Monte Carlo, or its hyper-spheres may be used as reflection surfaces for Galilean Monte Carlo. The scale-size of the region (\(R\)) can also be used to tune the step size. Such a RadFriends/MCMC hybrid method written in C, named UltraNest, is available at http://johannesbuchner.github.io/nested-sampling/UltraNest/. A framework for developing and testing nested sampling algorithms in Python is available at http://johannesbuchner.github.io/nested-sampling/, for which we welcome contributions. A reference implementation of RADFRIENDS can also be found there.

## Footnotes

## Notes

### Acknowledgments

I would like to thank Frederik Beaujean and Udo von Toussaint for reading the initial manuscript. I acknowledge funding through a doctoral stipend by the Max Planck Society. This manuscript has greatly benefited from the comments of the two anonymous referees, whom I would also like to thank. I acknowledge financial support through a Max Planck society stipend.

### References

- Aitken, S., Akman, O.E.: Nested sampling for parameter inference in systems biology: application to an exemplar circadian model. BMC Syst. Biol.
**7**(1), 72 (2013)CrossRefGoogle Scholar - Beaujean, F., Caldwell, A.: Initializing adaptive importance sampling with Markov chains. ArXiv (e-prints) (2013)Google Scholar
- Betancourt, M.: Nested sampling with constrained Hamiltonian Monte Carlo. In Mohammad-Djafari, A., Bercher, J.-F., & Bessiére, P. (eds.) American Institute of Physics Conference Series, vol. 1305, pp. 165–172. American Institute of Physics Conference Series (2011)Google Scholar
- Cameron, E., Pettitt, A.: Recursive pathways to marginal likelihood estimation with prior-sensitivity analysis. ArXiv e-prints (2013)Google Scholar
- Chopin, N., Robert, C.: Comments on nested sampling by john skilling. Bayesian Stat.
**8**, 491–524 (2007)Google Scholar - Chopin, N., Robert, C.P.: Properties of nested sampling. Biometrika. (2010)Google Scholar
- Evans, M.: Discussion of nested sampling for bayesian computations by john skilling. Bayesian Stat.
**8**, 491–524 (2007)Google Scholar - Feroz, F., Hobson, M.P.: Multimodal nested sampling: an efficient and robust alternative to Markov Chain Monte Carlo methods for astronomical data analyses. MNRAS
**384**, 449–463 (2008)CrossRefGoogle Scholar - Feroz, F., Hobson, M.P., Bridges, M.: MULTINEST: an efficient and robust Bayesian inference tool for cosmology and particle physics. MNRAS
**398**, 1601–1614 (2009)CrossRefGoogle Scholar - Feroz, F., Hobson, M. P., Cameron, E., Pettitt, A. N.: Importance nested sampling and the MultiNest algorithm. ArXiv e-prints. (2013)Google Scholar
- Mukherjee, P., Parkinson, D., Liddle, A.R.: A nested sampling algorithm for cosmological model selection. ApJ
**638**, L51–L54 (2006)CrossRefGoogle Scholar - Sivia, D., Skilling, J.: Data Analysis: A Bayesian Tutorial. Oxford science publications. Oxford University Press, Oxford (2006)Google Scholar
- Skilling, J.: Nested sampling. In:
*AIP Conference Proceedings*, vol. 735, p. 395. (2004)Google Scholar - Skilling, J.: Nested sampling’s convergence. In: BAYESIAN INFERENCE AND MAXIMUM ENTROPY METHODS IN SCIENCE ANDENGINEERING. The 29th International Workshop on Bayesian Inference andMaximum Entropy Methods in Science and Engineering, vol. 1193, pp. 277–291. AIP Publishing, New York (2009)Google Scholar
- Skilling, J.: Bayesian computation in big spaces-nested sampling and galilean monte carlo. AIP Conf. Proc.
**1443**(1), 145–156 (2012)CrossRefGoogle Scholar