1 Introduction

Realistic simulations of scattering events at particle collider experiments play an indispensable role in the analysis and interpretation of actual measurement data for example at the Large Hadron Collider (LHC) [1, 2]. A central component of such event simulations is the generation of hard scattering configurations according to a density given by the squared transition matrix element of the concrete process under consideration. This is needed both for the evaluation of corresponding cross sections, as well as the explicit generation of individual events that potentially get further processed, e.g. by attaching parton showers, invoking phenomenological models to account for the parton-to-hadron transition, and eventually, a detector simulation. To adequately address the physics needs of the LHC experiments requires the evaluation of a wide range of high-multiplicity hard processes that feature a highly non-trivial multimodal target density that is rather costly to evaluate. The structure of the target is thereby affected by the appearance of intermediate resonances, quantum interferences, the emission of soft and/or collinear massless gauge bosons, or non-trivial phase space constraints, due to kinematic cuts on the final state particles. Dimensionality and complexity of the phase space sampling problem make the usage of numerical methods, and in particular Monte Carlo techniques, for its solution indispensable.

The most widely used approach relies on adaptive multi-channel importance sampling, see for example [3,4,5,6,7]. However, to achieve good performance detailed knowledge of the target distribution, i.e. the squared matrix element, is needed. To this end information about the topology of scattering amplitudes contributing to the considered process is employed in the construction of individual channels. Alternatively, and also used in combination with importance sampling phase space maps, variants of the self-adaptive algorithm [8] are routinely applied [9,10,11,12].

An alternative approach for sampling according to a desired probability density is offered by Markov Chain Monte Carlo (MCMC) algorithms. However, in the context of phase space sampling in high energy physics these techniques attracted rather limited attention, see in particular [13, 14]. More recently a mixed kernel method combining multi-channel sampling and MCMC, dubbed \((\text {MC})^3\), has been presented [15]. A typical feature of such MCMC based algorithms is the potential autocorrelation of events that can affect their direct applicability in typical use case scenarios of event generators.

To meet the computing challenges posed by the upcoming and future LHC collider runs and the corresponding event simulation campaigns, improvements of the existing phase space sampling and event unweighting techniques will be crucial [16, 17]. This has sparked renewed interest in the subject, largely driven by applications of machine learning techniques, see for instance [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36].

In this article we explore an alternative direction. We here study the application of Nested Sampling [37] as implemented in PolyChord  [38] to phase space integration and event generation for high energy particle collisions. We here assume no prior knowledge about the target and investigate the ability of the algorithm to adapt to the problem. Nested Sampling has originally been proposed to perform Bayesian inference computations for high dimensional parameter spaces, providing also the evidence integral, i.e. the integral of the likelihood over the prior density. This makes it ideally suited for our purpose. In Sect. 2 we will introduce Nested Sampling as a method to perform cross section integrals and event generation, including a reliable uncertainty estimation. In Sect. 3 we will apply the method to gluon scattering to 3-, 4-, and 5-gluon final states as a benchmark for jet production at hadron colliders, thereby comparing results for total cross sections and differential distributions with established standard techniques. Evaluation of the important features of the algorithm when applied in the particle physics context is also discussed in this section. In Sect. 4 we illustrate several avenues for future research, extending the work presented here. Finally, we present our conclusions in Sect. 5.

2 Nested sampling for event generation

The central task when exploring the phase space of scattering processes in particle physics is to compute the cross section integral, \(\sigma \). This requires the evaluation of the transition squared matrix element, \(|\mathcal {M} |^2\), integrated over the phase space volume, \(\varOmega \), where \(\varOmega \) is composed of all possible kinematic configurations, \(\Phi \), of the external particles. Up to some constant phase space factors this amounts to performing the integral,

$$\begin{aligned} \sigma = \int \limits _\varOmega d\Phi |\mathcal {M} |^2 (\Phi )\,. \end{aligned}$$
(1)

In practice rather than sampling the physical phase space variables, i.e. the particles’ four-momenta, it is typical to integrate over configurations, \(\theta \in [0,1]^D\), from the D-dimensional unit hypercube. Some mapping, \(\Pi :[0,1]^D\rightarrow \varOmega \), is then employed to translate the sampled variables to the physical momenta. The mapping is defined as, \(\Phi = \Pi (\theta )\), and the integral in Eq. (1) is written,

$$\begin{aligned} \sigma = \int \limits _{[0,1]^D} d\theta |\mathcal {M} |^2 (\Pi (\theta )) \mathcal {J}(\theta ) = \int \limits _{[0,1]^D} d\theta \mathcal {L}(\theta )\,. \end{aligned}$$
(2)

A Jacobian associated with the change of coordinates between \(\theta \) and \(\Phi \) has been introduced, \(\mathcal {J}\), and then absorbed into the definition of \(\mathcal {L}(\theta ) = |\mathcal {M} |^2(\Pi (\theta )) \mathcal {J}(\theta )\). With no general analytic solution to the sorts of scatterings considered at the high energy frontier, this integral must be estimated with numerical techniques. Numerical integration involves sampling from the \(|\mathcal {M} |^2\) distribution in a manner that gives a convergent estimate of the true integral when the samples are summed. As a byproduct this set of samples can be used to estimate integrals of arbitrary sub-selections of the integrated phase space volume, decomposing the total cross section into differential cross section elements, \(d\sigma \). Additionally these samples can be unweighted and used as pseudo-data to emulate the experimental observations of the collisions. The current state of the art techniques for performing these tasks were briefly reviewed in Sect. 1.

Importance Sampling (IS) is a Monte Carlo technique used extensively in particle physics when one needs to draw samples from a distribution with an unknown target probability density function, \(P(\Phi )\). Importance Sampling approaches this problem by instead drawing from a known sampling distribution, \(Q(\Phi )\) (A number of standard texts for inference give more thorough exposition of the general sampling theory used in this paper, see e.g. [39]). Samples drawn from Q are assigned a weight, \(w=P(\Phi )/Q(\Phi )\), adjusting the importance of each sampled point. The performance of IS rests heavily on how well the sampling distribution can be chosen to match the target, and adaptive schemes like are employed to refine initial proposals. It is well established that as the dimensionality and complexity of the target increase, the task of constructing a viable sampling distribution becomes increasingly challenging.

Markov Chain based approaches fundamentally differ in that they employ a local sampling distribution and define an acceptance probability with which to accept new samples. Markov Chain Monte Carlo (MCMC) algorithms are widely used in Bayesian inference. Numerical Bayesian methods have to be able to iteratively refine the prior distribution to the posterior, even in cases where the two distributions are largely disparate, making stochastic MCMC refinement an indispensable tool in many cases. This is an important conceptual point; in the particle physics problems presented in this work we are sampling from exact theoretically derived distributions. The lack of noise and a priori well known structure make methods with deterministic proposal distributions such as IS more initially appealing, however at some point increasing the complexity and dimensionality of the problem forces one to use stochastic methods. Lattice QCD calculations are a prominent example set of adjacent problems sampling from theoretical distributions that make extensive use of MCMC approaches [40]. MCMC algorithms introduce an orthogonal set of challenges to IS; a local proposal is inherently simpler to construct, however issues with exploration of multimodal target distributions and autocorrelation of samples become new challenges to address.

Nested Sampling (NS) is a well established algorithm for numerical evaluation of high dimensional integrals [37]. NS differs from typical MCMC samplers as it is primarily an integration algorithm, hence by definition has to overcome a lot of the difficulties MCMC samplers face in multimodal problems. A recent community review of its various applications in the physical sciences, and various implementations of the algorithm has been presented in [41].

Fig. 1
figure 1

Schematic of live point evolution (blue dots) in Nested Sampling, over a two-dimensional function whose logarithm is the negative Himmelblau function (contours). Points are initially drawn from the unit hypercube (top panel). The points on the lowest contours are successively deleted, causing the live points to contract around the peak(s) of the function. After sufficient compression is achieved, the dead points (orange) may be weighted to compute the volume under the surface and samples from probability distributions derived from the function

At its core NS operates by maintaining a number, \(n_\text {live}\), of live point samples. This ensemble of live points is initially uniformly sampled from \(\theta \in [0,1]^D\) – distributed in the physical volume \(\varOmega \) according to the shape of the mapping \(\Pi \). These live points are sorted in order of \(\mathcal {L} (\theta )\) evaluated at the phase space point, and the point with the lowest \(\mathcal {L}\), \(\mathcal {L} _{\text {min}}\), in the population is identified. A replacement for this point is found by sampling uniformly under a hard constraint requiring, \(\mathcal {L} >\mathcal {L} _{\text {min}} \). The volume enclosed by this next iteration of live points has contracted and the procedure of identifying the lowest \(\mathcal {L}\) point and replacing it is repeated. An illustration of three different stages of this iterative compression on an example two-dimensional function are shown in Fig. 1. The example function used in this case has four identical local maxima to find, practical exploration and discovery of the modes is achieved by having a sufficient (\(\mathcal {O}\left( 10\right) \)) initial samples in the basis of attraction of each mode. This can either be achieved by brute force sampling a large number of initial samples, or by picking an initial mapping distribution that better reflects the multi-modal structure. By continually uniformly sampling from a steadily compressing volume, NS can estimate the density of points which is necessary for computing an integral as given by Eq. (1). Once the iterative procedure reaches a point where the live point ensemble occupies a predefined small fraction of the initial volume, \(T_C\), the algorithm terminates. The fraction \(T_C\) can be characterised as the termination criterion. The discarded points throughout the evolution are termed dead points which can be joined with the remaining live points to form a representative sample of the function, that can be used to estimate the integral or to provide a random sample of events.

To estimate the integral and generate (weighted) random samples, Nested Sampling achieves this by probabilistically estimating the volume of the shell between the two outermost points as approximately \(\frac{1}{n_\text {live}}\) of the current live volume. The volume \(X_j\) within the contour \(\mathcal {L}_j\) – defined by the point with \(\mathcal {L} _{\text {min}} \) – at iteration j may therefore be estimated as,

$$\begin{aligned} X_j= & {} \int _{\mathcal {L}(\theta )>\mathcal {L}_j}d\theta \qquad \\\Rightarrow & {} \quad {} X_0=1, \\ \quad P(X_{j}|X_{j-1})= & {} \frac{X_j^{n_\text {live}-1}}{n_\text {live} X_{j-1}^{n_\text {live}}} \qquad \\\Rightarrow & {} \quad {}\log X_j \approx \frac{-j \pm \sqrt{j}}{n_\text {live}}. \end{aligned}$$

The cross section and probability weights can therefore be estimated as,

$$\begin{aligned} \sigma = {}&\int d\theta \mathcal {L}(\theta ) = \int dX \mathcal {L}(X) \nonumber \\ \approx {}&\sum _j \mathcal {L}_j \Delta X_j, \qquad w_j \approx \frac{\Delta X_j\mathcal {L}_j }{\sigma }. \end{aligned}$$
(3)

Importantly, for all of the above the approximation signs indicate errors in the procedure of probabilistic volume estimation, which are fully quantifiable.

The method to sample new live points under a hard constraint can be realised in multiple ways, and this is one of the key differences in the various implementations of NS. In this work we employ the PolyChord implementation of Nested Sampling [38], which uses slice sampling [42] MCMC steps to evolve the live points. NS can be viewed as being an ensemble of many short Markov Chains.

Much of the development and usage of NS has focused on the problem of calculation of marginal likelihoods (or evidences) in Bayesian inference, particularly within the field of Cosmology [43,44,45,46,47,48]. We can define the Bayesian evidence, \(\mathcal {Z}\), analogously to the particle physics cross section, \(\sigma \). NS in this context evaluates the integral,

$$\begin{aligned} \mathcal {Z} = \int d\theta \mathcal {L} (\theta ) \pi (\theta )\,, \end{aligned}$$
(4)

where the likelihood function, \(\mathcal {L}\), plays a similar role to \(|\mathcal {M} |^2\). In the Bayesian inference context, the phase space over which we are integrating, \(\theta \), has a measure defined by the prior distribution, \(\pi (\theta )\), which without loss of generality under a suitable coordinate transformation can be taken to be uniform over the unit hypercube. Making the analogy between the evidence and the cross section explicit will allow us to apply some of the information theoretic metrics commonly used in Bayesian inference to the particle physics context [49], and provide terminology used throughout this work. Among a wide array of sampling methods for Bayesian inference, NS possesses some unique properties that enable it to successfully compute the high dimensional integral associated with Eq. (4). These properties also bear a striking similarity to the requirements one would like to have to explore particle physics phase spaces. These are briefly qualitatively described as follows:

  • NS is primarily a numerical integration method that produces posterior samples as a by product. In this respect it is comfortably similar to Importance Sampling as the established tool in particle physics event generation. It might initially be tempting to approach the particle physics event generation task purely as a posterior sampling problem. Standard Markov Chain based sampling tools cannot generically give good estimates of the integral, so are not suited to compute the cross section. Additionally issues with coverage of the full phase space from the resulting event samples are accounted for by default by obtaining a convergent estimate of the integral over all of the phase space.

  • NS naturally handles multimodal problems [45, 46]. The iterative compression can be augmented by inserting steps that cluster the live points periodically throughout the run. Defining subsets of live points and evolving them separately allows NS to naturally tune itself to the modality of unseen problems.

  • NS requires a construction that can handle sampling under a hard likelihood constraint in order to perform the compression of the volume throughout the run. Hard boundaries in the physics problem, such as un-physical or deliberately cut phase space regions, manifest themselves in the sampling space as a natural extension of these constraints.

  • NS is largely self tuning. Usage in Bayesian inference has found that NS can be applied to a broad range of problems with little optimisation of hyper-parameters necessary [50,51,52]. NS can adapt to different processes in particle physics without any prior knowledge of the underlying process needed.

The challenge to present NS in this new context is to find an even comparison of sampling performance between NS and IS. It is typical in phase space sampling to compare the difference between the target and the sampling distribution as reducing the variation between these two distributions gives a clear metric of performance for IS. For NS there is no such global sampling distribution; the closest analogue being the prior which is then iteratively refined with local proposals to an estimate of the target. In Sect. 2.1 we attempt to compare the sampling distribution between NS and IS using a toy problem, however in the full physical gluon scattering example presented in Sect. 3 we instead focus directly on the properties of the estimated target distribution as this is the most direct equitable point of comparison.

2.1 Illustrative example

To demonstrate the capabilities of NS we apply the algorithm to an illustrative sampling problem in two dimensions. Further examples validating PolyChord on a number of challenging sampling toy problems are included in the original paper [38], here we present a modified version of the Gaussian Shells scenario. An important distinction of the phase space use case not present in typical examples is the emphasis on calculating finely binned differential histograms of the total integral. As a comparison to NS, we sample the same problem with a method that is well-known in high energy physics – adaptive Importance Sampling (IS), realised using the algorithm.

For our toy example we introduce a “stop sign” target density, whose unnormalised distribution is defined by

$$\begin{aligned} f(x, y)&= \frac{1}{2\pi ^2} \frac{\Delta r}{\left( \sqrt{(x-x_0)^2+(y-y_0)^2}-r_0\right) ^2+(\Delta r)^2} \nonumber \\&\quad \cdot \frac{1}{\sqrt{(x-x_0)^2+(y-y_0)^2}} \nonumber \\&\quad + \frac{1}{2\pi r_0} \frac{\Delta r}{((y-y_0)-(x-x_0))^2+(\Delta r)^2}\nonumber \\&\quad \cdot \Theta \left( r_0 - \sqrt{(x-x_0)^2+(y-y_0)^2}\right) \,, \end{aligned}$$
(5)

where \(\Theta (x)\) is the Heaviside function. It is the sum of a ring and a line segment, both with a (truncated) Cauchy profile. The ring is centred at \((x_0, y_0) = (0.5, 0.5)\) and has a radius of \(r_0 = 0.4\). The line segment is located in the inner part of the ring and runs through the entire diameter. We set the width of the Cauchy profile to \(\Delta r = 0.002\). This distribution can be seen as an example of a target where it makes sense to tackle the sampling problem with a multi-channel distribution. One channel could be chosen to sample the ring in polar coordinates and one to sample the line segment in Cartesian coordinates. However, here we deliberately use as a single channel in order to highlight the limitations of the algorithm. From the perspective of a single channel, there is no coordinate system to factorise the target distribution. That poses a serious problem for , as it uses a factorised sampling distribution where the variables are sampled individually. Both algorithms are given zero prior knowledge of the target, thus starting with a uniform prior distribution.

Fig. 2
figure 2

A two-dimensional toy example: a Histogram of the target function along with the marginal sampling distributions of and PolyChord. b Ratio of the target function and the probability density function of . c Ratio of the target density to the sampling density of PolyChord

Our grid has 200 bins per dimension. We train it over 10 iterations where we draw 30k points from the current mapping and adapt the grid to the data. The distribution defined by the resulting grid is then used for IS without further adaptation. This corresponds to the typical use in an event generator, where there is first an integration phase in which, among other things, is adapted, followed by a non-adaptive event generation phase. We note that gets an advantage in this example comparison as we do not include the target evaluations from the training into the counting. However, it should be borne in mind that in a realistic application with a large number of events to be generated, the costs for training are comparatively low. For NS we use PolyChord with a number of live points \(n_\text {live} = {1000}\) and a chain length \(n_\text {repeats} = 4\), more complete detail of PolyChord settings and their implication are given in Sect. 3.1. Figure 2a shows the bivariate target distribution along with the marginal x and y distributions of the target, and PolyChord. For this plot (as well as for Fig. 2c) we merged 70 independent runs of PolyChord to get a better visual representation due to the larger sample size. It can be seen that both algorithms reproduce the marginal distributions reasonably well. There is some mismatch at the boundaries for . This can be explained by the fact that , as a variance-reduction method, focuses on the high-probability regions, where it puts many bins, and uses only few bins for the comparably flat low-probability regions. As a result, the bins next to the boundaries are very wide and overestimate the tails. PolyChord also oversamples the tails, reflecting the fact that in this example the prior is drastically different from the posterior, meaning the initial phase of prior sampling in PolyChord is very inefficient. In addition it puts too many points where the ring and the line segment join, which is where we find the highest values of the target function. This is not a generic feature of NS at the termination of the algorithm, rather it reflects the nature of having two intersecting sweeping degenerate modes in the problem, a rather unlikely scenario in any physical integral.

Figure 2b shows the ratio between the target distribution and the sampling distribution of , representing the IS weights. It can be seen that the marginals of the ratio are relatively flat, with values between 0.1 and 5.7. However, in two dimensions the ratio reaches values up to \(1\times 10^{-2}\). By comparing Fig. 2a and b, paying particular attention to the very similar ranges of function values, it can be deduced that almost completely misses to learn the structure of the target. It tries to represent the peak structure from the ring and the line segment by an enclosing square with nearly uniform probability distribution.

The same kind of plot is shown in Fig. 2c for the PolyChord data. NS does not strictly define a sampling distribution, however a proxy for this can be visualised by plotting the density of posterior samples. Here the values of the ratio are much smaller, between \(1\times 10^{-2}\) and 7. PolyChord produces a flatter ratio function than while not introducing additional artifacts that are not present in the original function. The smallest/largest values of the ratio are found in the same regions as the smallest/largest values of the target function, implying that PolyChord tends to overestimate the tails and to underestimate the peaks. This can be most clearly explained by examining the profile of where posterior mass is distributed throughout a run, an important diagnostic tool for NS runs [53]. It is shown in Fig. 3, where the algorithm runs from left to right; starting with the entire prior volume remaining enclosed by the live points, \(\log X=0\), and running to termination, when the live points contain a vanishingly small remaining prior volume. The posterior mass profile, shown in blue, is the analogue to the sampling density in . To contextualise this against the target function, a profile of the log-likelihood of the lowest live point in the live point ensemble is similarly shown as a function of the remaining prior volume, X. Nested Sampling can be motivated as a likelihood scanner, sampling from monotonically increasing likelihood shells. These two profiles indicate some features of this problem, firstly a phase transition is visible in the posterior mass profile. This occurs when the degenerate peak of the ring structure is reached, the likelihood profile reaches a plateau where the iterations kill off the degenerate points at the peak of the ring, before proceeding to scan up the remaining line segment feature. An effective second plateau is found when the peak of the line segment is reached, with a final small detail being the superposition of the ring likelihood on the line segment. Once the live points are all occupying the extrema of the line segment, there is a sufficiently small prior volume remaining that the algorithm terminates. The majority of the posterior mass, and hence sampling density is distributed around the points where the two peaks are ascended. This reflects the stark contrast between the prior initial sampling density and the target, the samples are naturally distributed where the most information is needed to effectively compress the prior to the posterior.

Fig. 3
figure 3

Likelihood (\(\log \mathcal {L} \)) and posterior mass (\(\mathcal {L} X\)) profiles for a run of PolyChord on the example target density. The x-axis tracks the prior volume remaining as the run progresses, with \(\log X=0\) corresponding to the start of the run, with the algorithm compressing the volume from left to right, where the run terminates

Table 1 Comparison of and NS for the toy example in terms of size of event samples produced. \(N_\mathcal {L} \) gives the number of target evaluations, \(N_W\) the number of weighted events and \(N_\mathrm {equal}\) the derived number of equal weight events. A MC slice sampling efficiency, \(\epsilon _{\text {ss}}\), is listed for NS. A total, \(\epsilon \), and unweighting, \(\epsilon _{\text {uw}}\), efficiency are listed for both algorithms. We report the mean and standard deviation of ten independent runs of the respective algorithm

We compare the efficiencies of the two algorithms for the generation of equal-weight events in Table 1. It shows that PolyChord achieves an overall efficiency of \(\epsilon = {0.0113(90)}\) which is almost three times as high as the efficiency of . While for the overall efficiency \(\epsilon \) is identical to the unweighting efficiency \(\epsilon _{\text {uw}} \), determined by the ratio of the average event weight over the maximal weight in the sample, for PolyChord we also have to take the slice sampling efficiency \(\epsilon _{\text {ss}} \) into account, which results from the thinning of the Markov Chain in the slice sampling step. Here, the total efficiency \(\epsilon = \epsilon _{\text {ss}} \epsilon _{\text {uw}} \) is dominated by the slice sampling efficiency. We point out that it is in the nature of the NS algorithm that the sample size is not deterministic. However, the variance is not very large and it is easily possible to merge several NS runs to obtain a larger sample.

Table 2 shows the integral estimates along with the corresponding uncertainty measures. While the pure Monte Carlo errors are of the same size for both algorithms, there is an additional uncertainty for NS. It carries an uncertainty on the weights of the sampled points, listed as \(\Delta _{w}\). This arises due to the nature of NS using the volume enclosed by the live points at each iteration to estimate the volume of the likelihood shell. The variance in this volume estimate can be sampled, which is reflected as a sample of alternative weights for each dead point in the sample. Summing up these alternative weight samples gives a spread of predictions for the total integral estimate, and the standard deviation of these is quoted as \(\Delta _{w}\). This additional uncertainty compounds the familiar statistical uncertainty, listed as \(\Delta _\mathrm {MC}\) for all calculations. In Appendix A, we present the procedure needed to combine the two NS uncertainties to quote a total uncertainty, \(\Delta \sigma _\mathrm {tot}\), as naively adding in quadrature will overestimate the true error.

Table 2 Comparison of integrals calculated in the toy example with and NS, along with the respective uncertainties

3 Application to gluon scattering

As a first application and benchmark for the Nested Sampling algorithm, we consider partonic gluon scattering processes into three-, four- and five-gluon final states at fixed centre-of-mass energies of \(\sqrt{s}=1\,\text {TeV}\). These channels have a complicated phase space structure that is similar to processes with quarks or jets, while the corresponding amplitude expressions are rather straightforward to generate. The fixed initial and final states allow us to focus on the underlying sampling problem. For regularisation we apply cuts to the invariant masses of all pairs of final state gluons such that \(m_{ij}>{30}\,\hbox {GeV}\) and on the transverse momenta of all final state gluons such that \(p_{\mathrm {T},i}>{30}\,\hbox {GeV}\). The renormalisation scale is fixed to \(\mu _R=\sqrt{s}\). The matrix elements are calculated using a custom interface between PolyChord and the matrix element generator  [5] within the event generator framework [54]. Three established methods are used to provide benchmarks to compare NS to. Principle comparison is drawn to the sampler, optimised for QCD antenna structures [55], illustrating the exploration of phase space with the best a priori knowledge of the underlying physics included. It uses a cut-off parameter of \(s_0={900}\,\hbox {GeV}^{2}\). Alongside this, two algorithms that will input no prior knowledge of the phase space, i.e. the integrand, are used; adaptive importance sampling as realised in the algorithm [8] and a flat uniform sampler realised using the algorithm [56, 57]. remaps the variables of the parametrisation using 50, 70, 200 bins per dimension for the three-, four-, and five-gluon case, respectively. The grid is trained in 10 iterations using 100k training points each. Note, the dimensionality of the phase space for n-gluon production is \(D=3n-4\), where total four-momentum conservation and on-shell conditions for the external particles are implicit.

As a first attempt to establish NS in this context, we treat the task of estimating the total and differential cross sections of the three processes starting with no prior knowledge of the underlying phase space distribution. For the purposes of running PolyChord we provide the flat sampler as the prior, and the likelihood function provided is the squared matrix element. In contrast to , PolyChord performs the integration without any decomposition into channels, removing the need for any multichannel mapping. NS is a flexible procedure, and the objective of the algorithm can be modified to perform a variety of tasks, a recent example has presented NS for computation of small p-values in the particle physics context [58]. To establish NS for the task of phase space integration in this study, a standard usage of PolyChord is employed, mostly following default values used commonly in Bayesian inference problems.

The discussion of the application of NS to gluon-scattering processes is split into four parts. Firstly, the hyperparameters and general setup of PolyChord are explained in Sect. 3.1. In Sect. 3.2 a first validation of NS performing the core tasks of (differential) cross-section estimation from weighted events – against the algorithm – is presented. In Sect. 3.3 further information is given to contextualise the computational efficiency of NS against the alternative established tools for these tasks. Finally a consideration of unweighted event generation with NS is presented in Sect. 3.4.

3.1 PolyChord hyperparameters

Table 3 PolyChord hyperparameters used for this analysis, parameters not listed follow the PolyChord defaults

The hyperparameters chosen to steer PolyChord are listed in Table 3. These represent a typical set of choices for a high resolution run with the objective of producing a large number of posterior samples. The number of live points is one of the parameters that is most free to tune, being effectively the resolution of the algorithm. Typically \(n_\text {live}\) larger than \(\mathcal {O}\left( 1000\right) \) gives diminishing returns on accuracy, Bayesian inference usage in particle physics has previously employed \(n_\text {live} =4000\) [59] to provide some context for the choice made in this work. The particular event generation use case, partitioning the integral into arbitrarily small divisions (differential cross sections), logically favours a large \(n_\text {live}\) (resolution). The number of repeats is a parameter that controls the length of the slice sampling chains, the value chosen is the recommended default for reliable posterior sampling, whereas \(n_\text {rep} =n_\text {dim} \times 5\) is recommended for evidence (total integral) estimation. As this study aims to cover both differential and total cross sections, the smaller value is favoured as there is a strong limit on the overall efficiency imposed by how many samples are needed to decorrelate the Markov Chains.

Table 4 Comparison of integrals calculated for the three-, four- and five-gluon processes using , , NS and , along with the respective uncertainties

An important point to note is in how PolyChord treats unphysical values of the phase space variables, e.g. if they fall outside the fiducial phase space defined by cuts on the particle momenta. This is not an explicit hyperparameter of PolyChord, rather how the algorithm treats points with zero likelihood. In both the established approaches and in PolyChord the sampling is performed in the unit hypercube, which is then translated to the physical variables which can be evaluated for consistency and rejected if they are not physically valid. One of the strengths of NS is that the default behavior is to consider points which return zero likelihoodFootnote 1 as being excluded at the prior level. During the initial prior sampling phase, unphysical points are set to log-zero and the sampling proceeds until \(n_\text {prior}\) initial physical samples have been obtained. Provided each connected physical region contains some live points after this initial phase, the iterative phase of MCMC sampling will explore up to the unphysical boundary. This effect necessitates a correction factor to be applied to the integral, derived as the ratio of total initial prior samples to the physically valid prior samples. In practice the correction factor is found in the prior_info file written out by PolyChord. An uncertainty on this correction can be derived from order statistics [60], however it was found to be negligibly small for the purposes of this study so is not included.

Another standout choice of hyperparameter is the chosen value of \(n_\text {prior}\). The number of prior samples is an important hyperparameter that would typically be set to some larger multiple of \(n_\text {live}\) in a Bayesian inference context, \(n_\text {prior} =10\times n_\text {live} \) would be considered sensible for a broad range of tasks. For the purpose of generating weighted events, using a larger value would generally be advantageous, however increasing \(n_\text {prior}\) will strongly decrease the efficiency in generating unweighted events. As the goal is to construct a generator taking an uninformed prior all the way through to unweighted events, the default value listed is used. However it is notable that this is a particular feature of starting from an uninformed prior, if more knowledge were to be included in the prior then a longer phase of prior sampling becomes advantageous. The final parameter noted, the factor by which to boost posterior samples, has no effect on PolyChord at runtime. Setting this to be equal to the number of repeats simply writes out the maximum number of dead points, hence is needed in this scenario. All plots and tables in the remainder of this section are composed of one single run of PolyChord with these settings, with the additional entries in Table 4 demonstrating a join of ten such runs.

3.2 Exploration and integrals

Before examining the performance of NS in detail, it is first important to validate that the technique is capable of fully exploring particle physics phase spaces in these chosen examples. The key test to validate this is to compare if various differential cross sections calculated with NS are statistically consistent with the established techniques. To do this, a single NS and sample of weighted events is produced, using approximately similar levels of computational overhead (more detail on this is given in Sect. 3.3). Both sets of weighted events are analysed using the default MC_JETS Rivet routine [61]. Rivet produces binned differential cross sections as functions of various physical observables of the outgoing gluons. For each process, the total cross section for the NS sample is normalised to the sample, and a range of fine grained differential cross sections is calculated using both algorithms covering the following observables; \(\eta _{i}\), \(y_i\), \(p_{\mathrm {T},i}\), \(\Delta \phi _{ij}\), \(m_{ij}\), \(\Delta R_{ij}\), \(\Delta \eta _{ij}\), where \(i\ne j\) label the final state jets, reconstructed using the anti-\(k_T\) algorithm [62] with a radius parameter of \(R=0.4\) and \(p_{\mathrm {T}}>30\,\mathrm {GeV}\). The normalised difference between the NS and differential cross section in each bin can be computed as,

$$\begin{aligned} \chi = \frac{d\sigma _{\mathrm {HAAG}} - d\sigma _{\mathrm {NS}}}{ \sqrt{\Delta _{\mathrm {HAAG}}^2 + \Delta _{\mathrm {NS}}^2} } \,, \end{aligned}$$
(6)

in effect this is the differences between the two algorithms normalised by the combined standard deviation. By summing up this \(\chi \) deviation across all the available bins in each process, a test to see if the two algorithms are convergent within their quoted uncertainties can be performed. Since over 500 bins are populated and considered in each process, it is expected that the rate of these \(\chi \) deviations should be approximately normally distributed. This indeed appears to hold, and these summed density estimates across all observables are shown in Fig. 4, alongside an overlaid normal distribution with mean zero and variance one, \(\mathcal{{N}}(0,1)\), to illustrate the expected outcome. Two example variables that were used to build this global deviation are also shown; the leading jet \(p_\mathrm {T}\) in Fig. 5a and \(\Delta R_{12}\), the distance of the two leading jets in the \((\eta ,\phi )\) plane, in Fig. 5b.

Fig. 4
figure 4

Global rate of occurrence of per bin deviation, \(\chi \), between and NS, for each considered scattering process. A normally distributed equivalent deviation rate is shown for comparison

Fig. 5
figure 5

Two example physical differential observables computed with weighted events using the and NS algorithms. The top panels show the physical distributions, the middle panels display the relative component error sources, and the bottom panel displays the normalised deviation. The deviation plot has been normalised such that \(\chi =1\) corresponds to an expected \(1\sigma \) deviation of a Gaussian distribution. Note that for illustrative purposes the cross sections for the four- and five-gluon processes have been scaled by global factors

The composition of the quoted uncertainty for the two algorithms differs, demonstrating an important feature of an NS calculation. For , and IS in general, it is conventional to quote the uncertainty as the standard error from the effective number of fills in a bin. Nested Sampling on the other hand introduces an uncertainty on the weights used to fill the histograms themselves, effectively giving rise to multiple weight histories that must be sampled to derive the correct uncertainty on the NS calculation. Details on this calculation are supplied in Appendix A. In summary the alternative weight histories give an overlapping measure of the statistical uncertainty, so this effect must be accounted for in situ alongside taking the standard deviation of the weight histories. To contextualise this, the middle panels in Fig. 5 show the correct combined uncertainty (using the recipe from Appendix A) as a grey band, against the bands derived from the standard error of each individual algorithm (henceforth \(\Delta _\mathrm {MC}\)) as dashed lines, and the complete NS error treatment as a dotted line. The standard error (dashed) NS band in these panels is a naive estimation of the full NS uncertainty (dotted), however this illustrates an important point; at the level of fine grained differential observables the NS uncertainty is dominated by statistics and is hence reducible as one would expect by repeated runs. Based on the example observables we can initially conclude that whilst both algorithms appear compatible, when using weighted events NS generally has a larger uncertainty than across most of the range (given a roughly equivalent computational overhead). However, further inspection of the resulting unweighted event samples derived from these weighted samples in the remaining sections reveals a more competitive picture between the two algorithms.

The estimates of the total cross sections, derived from the sum of weighted samples, provided in Table 4, give an alternative validation that NS is sufficiently exploring the phase space by ensuring that compatible estimates of the cross sections are produced between all the methods reviewed in this study. The central estimates of the total cross sections are generally consistent within the established error sources for all calculations considered. In this table the components of the error calculation for NS are listed separately; \(\Delta _{w}\) being the standard deviation resulting from the alternative weight histories and \(\Delta _\mathrm {MC}\) being the standard error naively taken from the mean of the alternative NS weights. In contrast to the differential observables, the naive counting uncertainty is small so has negligible effect at the level of total cross sections. In summary, for a total cross section the spread of alternative weight histories gives a rough estimate of the total error, whereas for a fine grained differential cross section the standard error dominates. The way to correctly account for the effect of counting statistics within the weight histories is given in Appendix A.

Table 5 Comparison of the four algorithms for the three processes in terms of size of event samples produces. \(N_\mathcal {L} \) gives the number of matrix element evaluations, \(N_W\) the number of weighted events, \(N_{W,\mathrm {eff}}\) the effective number of weighted events and \(N_\mathrm {equal}\) the derived number of equal-weight events. A MC slice sampling efficiency, \(\epsilon _{\text {ss}}\), is listed for NS. A total, \(\epsilon \), and an unweighting, \(\epsilon _{\text {uw}}\), efficiency are listed for all algorithms

Repeated runs of NS will reduce these uncertainties. The anesthetic package [63] is used to analyse the NS runs throughout this paper, and contains a utility to join samples. Once samples are joined consistently into a larger sample, the uncertainties can be derived as already detailed. The result of joining 10 equivalent NS runs with the previously motivated hyperparameters is also listed in Table 4. Joining 10 runs affects the \(\Delta \sigma _\mathrm {tot}\) for NS in two ways; reducing the spread of weighted sums composing \(\Delta _{w}\) (i.e. reducing \(\Delta _\mathrm {MC}\)), and reducing the variance of distribution for each weight itself (i.e. the part of \(\Delta _{w}\) that does not overlap with \(\Delta _\mathrm {MC}\)). The former is reduced by simply having an increased size of samples produced, increasing the number of effective fills by a factor of \(\sim \)10 in this case, with the latter reduced due to the increased effective number of live points used for the volume estimation.

3.3 Efficiency of event generation

An example particle physics workflow on this gluon scattering problem would be to take as an initial mapping of the phase space (effectively representing the best prior knowledge of the problem), and using to refine the proposal distribution to optimally efficiently generate weighted events. Of the three existing tools presented in this study for comparison (, , and ), NS bears most similarity to , in that both algorithms learn the structure of the target integrand. To this end an atypical usage of is employed, testing how well could learn a proposal distribution from an uninformed starting point (). This is equivalent to how NS was employed, starting from an uninformed prior () and generating posterior samples via Nested Sampling. It was motivated so far that roughly similar computational cost was used for the previous convergence checks, and that the hyperparameters of PolyChord were chosen to emphasise efficient generation of unweighted events. In what follows, we analyse more precisely this key issue of computational efficiency.

The statistics from a single run of the four algorithms for the three selected processes is listed in Table 5. NS is non deterministic in terms of number of matrix element evaluations (\(N_\mathcal {L} \)), instead terminating from a pre determined convergence criterion of the integral. , , are all used to generate exactly 10M weighted events. The chosen PolyChord hyperparameters roughly align the NS method with the other three in terms of computational cost. One striking difference comes from the Markov Chain nature of NS. Default usage only retains a fraction of the total \(\mathcal {L}\) evaluations, inversely proportional to \(n_\text {rep}\). This results in a smaller number of retained weighted events, \(N_W\), than the number of \(\mathcal {L}\) evaluations, \(N_\mathcal {L} \), for NS. However the retained weighted events by construction match the underlying distribution much closer than the other methods, resulting in a higher unweighting efficiency, \(\epsilon _{\text {uw}}\), for the NS sample. Exact equal-weight unweighting can be achieved by accepting events with a probability proportional to the share of the sample weight they carry, this operation is performed for all samples of weighted events and the number of retained events is quoted as \(N_\mathrm {equal}\). NS as an unweighted event generator has some additional complexity due to the uncertainty in the weights themselves, this is given more attention in Sect. 3.4.

Due to differences in \(N_\mathcal {L} \) between NS and the other methods, it is most effective to compare the total efficiency in producing unweighted events, \(\epsilon =N_\mathrm {equal}/N_\mathcal {L} \). as the baseline illustrates the performance one would expect, inputting no prior knowledge and not adapting to any acquired knowledge. As such yields a tiny \(\epsilon \). represents the performance using the best state of prior knowledge but without any adaptation, in these tests this represents the best attainable \(\epsilon \). and NS start from a similar point, both using as an uninformed state of prior knowledge, but adapting to better approximate the phase space distribution as information is acquired. starts with a higher efficiency than NS for the 3-gluon process, but the efficiency drops by approximately an order of magnitude as the dimensionality of phase space is increased to the 5-gluon process. NS maintains a consistent efficiency of approximately a percent, competitive with the consistent approximately three percent efficiency obtained by .

As the key point of comparison for this issue is the efficiency, \(\epsilon \), this is highlighted with an additional visualisation in Fig. 6. The scaling behavior of the efficiency of each algorithm as a function of the number of outgoing gluons (corresponding to an increase in phase space dimensionality) is plotted for NS, and . From the same starting point, NS and can both learn a representation of the phase space, and do so in a way that yields a comparable efficiency to the static best available prior knowledge in . As the dimensionality of the space increases it appears that starts to suffer in how accurately it can learn the mapping, however NS is still able to learn the mapping in a consistently efficient manner.

Fig. 6
figure 6

Visualisation of the efficiencies listed in Table 5

3.4 Unweighted event generation

The fact that NS leads to a set of alternative weight histories poses a technical challenge in operating as a generator of unweighted events in the expected manner. Exact unweighting, compressing the weighted sample to strictly equally weighted events leads to a different set of events being accepted for each weight history. Representative yields of unweighted events can be calculated as shown in Table 5 using the mean weight for each event, but the resulting differential distributions will underestimate the uncertainty if this is quoted simply as the standard error in the bin, as described in Appendix A. The correct uncertainty recipe can be propagated through naively, by separately unweighting each weight history, however this requires saving as many event samples as required weight variations. Partial unweighting is commonly used in HEP event generation to allow a slight deviation from strict unit weights, to increase efficiency in practical settings. A modification to the partial unweighting procedure could be used to propagate the spread of weights to variations around accepted, approximate unit weight, events.

To conclude the exploration of the properties of NS as a generator for particle physics, a representative physical distribution calculated from a sample of exact unit-weight events is shown in Fig. 7. This sample is derived from the same weighted sample described in Table 5 and previously presented as a weighted event sample in Fig. 5a. The full set of NS variation weights is used to calculate the mean weight for each event, which is used to unweight the sample, for the chosen observable this is a very reasonable approximation as the fine binning means the standard error is the dominant uncertainty. The range of the leading jet transverse momenta has been extended into the tail of this distribution by modifying the default Rivet routine. This distribution largely reflects the information about the total efficiency previously illustrated in Fig. 6, projected onto a familiar differential observable. The total efficiency, \(\epsilon \), was noted as being approximately one percent from NS, compared to approximately three percent from across all processes. If the total number of matrix element evaluations, \(N_\mathcal {L} \), were to be made equal across all algorithms and processes, the performance would be further consistent.

Fig. 7
figure 7

The equivalent leading jet transverse momentum observable as calculated in Fig. 5a, using an exact unit weight compression of the same samples. A modified version of the default MC_JETS routine has been used to extend the \(p_\mathrm {T}\) range shown

4 Future research directions

Throughout Sect. 3, the performance of Nested Sampling in the context of particle physics phase space sampling and event generation was presented. A single choice of hyperparameters was made, effectively performing a single NS run as an entire end-to-end event generator; starting from zero knowledge of the phase space all the way through to generating unweighted events. Simplifying the potential options of NS to a single version of the algorithm was a deliberate choice to more clearly illustrate the performance of NS in this new context, using the same settings for multiple tasks gives multiple orthogonal views on how the algorithm performs. However this was a limiting choice, NS has a number of variants and applications that could more effectively be tuned to a subset of the tasks presented. Some of the possible simple alterations – such as increasing \(n_\text {prior}\) to improve weighted event generation at the expense of unweighting efficiency – were motivated already in this paper. In this section we outline four broad topics that extend the workflow presented here, bringing together further ideas from the worlds of Nested Sampling and phase space exploration.

4.1 Physics challenges in event generation

The physical processes studied in this work, up to 5-gluon scattering problems, are representative of the complexity of phase space calculation needed for the current precision demands of the LHC experiment collaborations [64]. However part of the motivation for this work, and indeed the broader increased interest in phase space integration methods, is due to the impending breaking point current pipelines face under the increased precision challenges of the HL-LHC programme. Firstly we observe that the phase space dimensionality of the highest multiplicity process studied here is 11. In broader Bayesian inference terms this is rather small, with NS being typically used for problems \(\mathcal {O}\left( 10\right) \) to \(\mathcal {O}\left( 100\right) \) dimensions, where it is uniquely able to perform numerical integration without approximation or strictly matching prior knowledge. The PolyChord implementation is styled as next-generation Nested Sampling, designed to have polynomial scaling with dimensionality aiming for robust performance as inference is extended to \(\mathcal {O}\left( 100\right) \) dimensions. Earlier implementations of NS, such as MultiNest  [46], whilst having worse dimensional scaling properties, may be a useful avenue of investigation for the lower dimensional problems considered in this paper.

This work validated NS in a context where current tools still can perform the required tasks, albeit at times at immense computational costs. Requirements from the HL-LHC strain the existing LHC event generation pipeline in many ways and pushing the sampling problem to higher dimensions is no exception [2]. Importance Sampling becomes exponentially more sensitive to how close the proposal distribution matches the target in higher dimensions, a clear challenge for particle physics in two directions; multileg processes rapidly increasing the sampling dimension [65] and corresponding radiative corrections (real and virtual) make it increasingly hard to provide an accurate proposal, e.g. through the sheer number of phase space channels needed and by having to probe deep into singular phase space regions [66]. We propose that NS is an excellent complement to further investigation on both these fronts. The robust dimensional scaling of NS illustrated against in Fig. 6 encapsulates both solid performance with increasing dimension, and the adherence to an uninformed prior whilst still attaining this scaling is promising for scenarios where accurate proposals are harder to construct.

4.2 Using prior knowledge

Perhaps the most obvious choice that makes the application here stylised is in always starting from an uninformed prior state of knowledge. Using Equations (2) and (4), the cross section integral with a phase space mapping was motivated as being exactly the Bayesian evidence integral with a choice of prior. To that end there is no real distinction between taking the non-uniform distribution as the prior instead of the flat density that was used in this study. In this respect NS could be styled as learning an additional compression to the posterior distribution, refining the static proposal distributions typically employed to initiate the generation of a phase space mapping (noting that this is precisely what aims to do in this context).

Naively applying a non-flat mapping exposes the conflicting aims at play in this set of problems however; efficiently generating events from a strongly peaked distribution, and generating high statistics estimates of the tails of the same distribution. Taking a flat prior is well suited to the latter problem, whereas taking a prior is better suited to the former. One particular hyperparameter of PolyChord that was introduced can be tuned to this purpose; the number of prior samples, \(n_\text {prior}\). If future work is to use a non flat, partially informed starting point, increasing \(n_\text {prior}\) well above the minimum (equal to the number of live points required) used in this study would be needed. A more complete direction for further work would be to investigate the possibility of mixing multiple proposal distributions [67, 68].

As a demonstration, we again apply NS to the toy example of Sect. 2.1 but this time using a non-uniform prior distribution. While a good prior would be an approximation of the target distribution, we choose to purposely miss an important feature of the target, the straight line segment, that the sampler still has to explore. Considering that in HEP applications the prior knowledge may be encoded in the mixture distributions of a multi-channel importance sampler, this is an extreme version of a realistic situation. As typically the number of channels grows dramatically with increasing final-state particle multiplicity, e.g. factorially when channels correspond to the topologies of contributing Feynman diagrams, one might choose to disable some sub-dominant channels in order to avoid a prohibitively large set of channels. However, this would lead to a mis-modelling of the target in certain phase-space regions.

Here we use only the ring part of the target, truncated on a circle that covers the unit hypercube, as our prior. Without an additional coordinate transformation this prior would not be of much use for as the line part remains on the diagonal. To sample from the prior, we first transform to polar coordinates. Then we sample the angle uniformly and the radial coordinate using a Cauchy distribution truncated to the interval \((0, 1/\sqrt{2}]\). In order to have good coverage of the tails, despite the strongly peaked prior, we increase \(n_\text {prior}\) to \(50\times n_\text {live}\). This results in a total efficiency of \(\epsilon = {0.037(4)}\), more than three times the value obtained with a uniform prior, cf. Table 1. While the unweighting efficiency reduces to \(\epsilon _{\text {uw}} = {0.17(2)}\), the slice sampling efficiency increases to \(\epsilon _{\text {ss}} = {0.216(7)}\). In Fig. 8 we show the ratio between the target function and the PolyChord sampling distribution. Compared to Fig. 2c, the ratio has a smaller range of values. Along the peak of the ring part of the target function, the ratio is approximately one. The largest values can be found around the line segment with PolyChord generating up to ten times less samples than required by the target distribution. It can be concluded that even with an intentionally poor prior distribution, PolyChord benefits from the prior knowledge in terms of efficiency and still correctly samples the target distribution including the features absent from the prior.

Fig. 8
figure 8

The ratio of the target function of the two-dimensional toy example and the probability density function of PolyChord using a non-uniform prior distribution. Black histogram bins have not been filled by any data due to limited sample size

4.3 Dynamic nested sampling

In addition to using a more informed prior to initiate the Nested Sampling process, a previous NS run can be used to further tune the algorithm itself to a particular problem. This is an existing idea in the literature known as dynamic Nested Sampling [69]. Dynamic NS uses information acquired about the likelihood shells in a previous NS run to varying the number of live points dynamically throughout the run. This results in a more efficient allocation of the computation towards the core aim of compressing the prior to the posterior. We expect that this would only increase the efficiency of the unweighting process, as the density of weighted events would be trimmed to even more closely match the underlying phase space density. Dynamic Nested Sampling naturally combines with the proposal of using prior knowledge to make a more familiar generator chain, however one that is driven primarily by NS. This mirrors the current established usage of in this context; using to refine the initial mapping by a redistribution of the input variables, to more efficiently generate from the acquired mapping.

4.4 Connection to modern machine learning techniques

There has been a great deal of recent activity coincident to this work, approaching similar sets of problems in particle physics event generation using modern Machine Learning (ML) techniques [70]. Much of this work is still exploratory in nature, and covers such a broad range of activity that comprehensively reviewing the potential for combining ML and NS is beyond the scope of this work. It is however clear that there is strong potential to include NS into a pipeline that modern ML is already aiming to optimise. To that aim, we identify a particular technique that has been studied previously in the particle physics context; using Normalising Flows to train phase space mappings [29,30,31]. In spirit a flow based approach, training an invertible probabilistic mapping between prior and posterior, bears a great deal of similarity to the core compression idea behind Nested Sampling. The potential in dovetailing Nested Sampling with a flow based approach has been noted in the NS literature [71], further motivating the potential for synergy here.

The ability of NS to construct mappings of high dimensional phase spaces without needing any strong prior knowledge, can be motivated as being an ideal forward model with which to train a Normalising Flow. In effect this replaces the generator part of the process with an importance sampler, whilst still using NS to generate the mappings. This is particularly ideal in this context, as the computational overhead required to decorrelate the Markov Chains imposes a harsh limit on the efficiency of a pure NS based approach. Combining these techniques in this way could retain the desirable features of both and serve to mitigate the ever increasing computational demands of energy frontier particle physics.

We close by noting that also in the area of lattice field theory Normalising Flows have recently attracted attention, see e.g. [72, 73], to address the sampling of multimodal target function. We envisage that also in these applications Nested Sampling could be applied.

5 Conclusions

The establishing study presented here had two main aims. Firstly to introduce the technique of Nested Sampling, applied to a realistic problem, to researchers in the particle physics community. Secondly to provide a translation back to researchers working on Bayesian inference techniques, presenting an important and active set of problems in particle physics that Nested Sampling could provide a valuable contribution to. The physical example presented used PolyChord to perform an end-to-end generation of events without any input prior knowledge. This is a stylised version of the event generator problem, intended to validate Nested Sampling in this new context and demonstrate some key features. For the considered multi-gluon production processes Nested Sampling was able to learn a mapping in an efficient manner that exhibits promising scaling properties with phase space dimension. We have outlined some potential future research directions; highlighting where the strengths of this approach could be most effective, and how to embed Nested Sampling in a more complete event generator workflow. Along these lines, we envisage an implementation of the Nested Sampling technique for the event generator framework [54], possibly also supporting operation on GPUs [74]. This will provide additional means to address the computing challenges for event generation posed by the upcoming LHC runs.