Accelerating HEP simulations with Neural Importance Sampling

: Many high-energy-physics (HEP) simulations for the LHC rely on Monte Carlo using importance sampling by means of the VEGAS algorithm. However, complex high-precision calculations have become a challenge for the standard toolbox, as this approach suffers from poor performance in complex cases. As a result, there has been keen interest in HEP for modern machine learning to power adaptive sampling. While previous studies have shown the potential of normalizing-flow-powered neural importance sampling (NIS) over VEGAS, there remains a gap in accessible tools tailored for non-experts. In response, we introduce ZüNIS , a fully automated NIS library designed to bridge this divide, while at the same time providing the infrastructure to customise the algorithm for dealing with challenging tasks. After a general introduction on NIS, we first show how to extend the original formulation of NIS to reuse samples over multiple gradient steps while guaranteeing a stable training, yielding a significant improvement for slow functions. Next, we introduce the structure of the library, which can be used by non-experts with minimal effort and is extensivly documented, which is crucial to become a mature tool for the wider HEP public. We present systematic benchmark results on both toy and physics examples, and stress the benefit of providing different survey strategies, which allows higher performance in challenging cases. We show that ZüNIS shows high performance on a range of problems with limited fine-tuning.


Introduction
High-Energy-Physics (HEP) simulations are at the heart of the Large Hadron Collider (LHC) program for studying the fundamental laws of nature.Most HEP predictins are expressed as expectation values, evaluated numerically as Monte Carlo (MC) integrals.This permits both the integration of the very complex functions and the reproduction of the data selection process by experiments.
Most HEP simulations tools [1][2][3] perform MC integrals using importance sampling, which allows to adaptively sample points to speed up convergence while keeping independent and identically distributed samples, crucial to reproduce experimental analyses which can only ingest uniformly-weighted data, typically produced by rejection sampling (see appendix A).
The most popular tool to optimize importance sampling in HEP is by far the VE-GAS algorithm [4], which fights the curse of dimensionality by assuming no correlations between the variables.While this is rarely the case in general, a good understanding of the integrand function can help significantly.Indeed optimized parametrizations using multichannelling [5][6][7] have become bread-and-butter tools for HEP event generation simulators, with good success for leading-order (LO) calculations.However, as simulations get more complex, either by having more complex final states or by including higher orders in perturbation theory, performance degrades fast.
We suspect that the reasons for this are twofold.First of all the lack of exchange between the two scientific communities is probably at cause and should probably not to be neglected.Second of all however, the impetus in the HEP community is to move toward more "black box" approaches where little is known of the structure of the integrand.This is one reason why approaches like PMC are less common, as it is sensitive to to the initial proposal density [12,22].As the main practical goal of this paper is to reduce the reliance of HEP simulations on the careful tuning of integrators, we will focus on comparing our work with the de facto HEP standard: the VEGAS algorithm.It is worth noting that alternative adaptive approaches, such as Nested Sampling [23,24], as employed in the SHERPA event generator [25], also exist.However, for the sake of maintaining focus and as it is the most widespread tool, we will concentrate on comparing our work with VEGAS.Our comparison will be based on one of the most recent implementations [20], which makes use of stratified sampling to boost the performance considerably.
There is much room for investing computational time into improving sampling [26]: modern HEP theoretical calculations are taking epic proportions and can require hours for a single function evaluation [27].Furthermore, unweighting samples can be extremely inefficient, with upwards of 90% sampled points discarded [28].More powerful importance sampling algorithms would therefore be a welcome improvement [29,30].
Several approaches have been explored to tackle the challenge of efficient sampling in high-energy physics simulations.Initial efforts utilized classical neural networks for sampling, but often encountered prohibitive computational costs [31][32][33].Another avenue of research has been the adoption of generative models, such as generative adversarial networks (GANs), which have shown promise in accelerating sampling speed significantly [34][35][36][37][38][39].While such approaches do improve sampling speed by a large factor, they have major limitations.In particular, they have no theoretical guarantees of providing a correct answer on average [40,41].
On the other hand, tools like SHERPA offer an unbiased NN-powered approach for unweighting [42].It is worth noting that reweighting techniques, such as the recently developed Deeply Conditionalized Transporter Networks (DCTR) [43][44][45][46], have become standard in boosting the efficiency generated samples.Despite these advancements, drawbacks persist.For instance, the quality of information in the reweighted samples is restricted to the training sample [40,41,47].
To avoid these disadvantages, our work exploits Neural Importance Sampling (NIS) [48,49], which relies on normalizing flows and has strong theoretical guarantees.
A number of works have been published on using NIS for LHC simulations [50][51][52][53][54][55], as well as closely related variations [40,56], but most studies have focused on preliminary investigation of performance and less on the practical usability of the method.Indeed, training requires function evaluations, which we are trying to minimize and data-efficiency training is therefore an important but often under-appreciated concern.Similarly, stability of training without the need of fine-tuning is essential for non-expert users.Furthermore, few authors have provided easily usable open source code, making the adoption of the technique in the HEP community difficult.
Our contribution to improve this situation can be summarized in three items: • The introduction of a new adaptive training algorithm for NIS.This permits the re-use of sampled points over multiple gradient descent steps, therefore making NIS much more data efficient, while at the same time guaranteeing stable and reliable training.
• A comparative study on the behaviour of different loss functions and their impact on performance for different processes.
• The introduction of ZüNIS, a PyTorch-based library providing robust and usable NIS tools, usable by non-experts.It implements previously-developped ideas as well as our new training procedure and is extensively documented.

Importance sampling as an optimization problem
Importance sampling relies on the interpretation of integrals as expectation values.Indeed, let us consider an integral over a finite volume: Let p be a strictly positive probability distribution over Ω, we can re-express our integral as the expectation of a random variable whose mean is indeed I and whose standard deviation is σ(f, p) √ N , where σ(f, p) is the standard deviation of f (X)/p(X) for X ∼ p: The problem statement of importance sampling is to find the probability distribution function p that minimizes the variance of our estimator for a given N .In the case of multichanneling [7,57], finding the probability distribution function p is replaced by finding weights α i and mappings g i such that and The g i are often chosen according to prior physics knowledge.As we are interested in a blackbox approach, we do not use different channels but instead try to learn an optimal probability distribution function.In Neural Importance Sampling, we rely on Normalizing Flows to approximate the optimal distribution, which we can optimize using stochastic gradient descent.However, it has been shown that both approaches can be successfully combined [54,55].Nevertheless, in this work we focus on achieving improvements with a single channel, in order to minimize the required prior knowldge of the integrand.

Normalizing flows and coupling cells
Normalizing flows [58][59][60][61] provide a way to generate complex probability distribution functions from simpler ones using parametric changes of variables that can be learned to approximate a target distribution.As such, normalizing flows are diffeomorphisms: invertible, (nearly-everywhere) differentiable mappings with a differentiable inverse.
Indeed, if u ∼ p(u), then T (u) = x ∼ q(x) where where J T is the Jacobian determinant of T : (2.7) In the world of machine learning, T is called a normalizing flow and is typically part of a parametric family of diffeomorphisms (T (•, θ)) such that gradients ∇ θ J T are tractable.
Coupling cell mappings perfectly satisfy this requirement [62][63][64]: they are neuralnetwork-parametrized bijections whose Jacobian factor can be obtained analytically without backpropagation or expensive determinant calculation.As such, they provide a good candidate for importance sampling as long as they can be trained to learn an unnormalized target function, which is exactly what neural importance sampling proposes.
The coupling cells used in the following implement the transformations proposed in [62][63][64], which are also used in i-flow and MadNIS.Although all the transformations are implemented, we focus our study on the use piecewise-quadratic layers, as these reach a high level of expressiveness without requiring a too large number of parameters.This is important, as a too high number of transformation parameters becomes expensive to learn.
Other than the architecture of NNs used by MadNIS in [54] and also the architecture chosen by i-flow, we use by default considerably larger NN (rectangular DNN with 8 hidden layers, each with 256 nodes), as the we learn the probability distribution function directly instead of splitting it into multiple channels.This guarantees also enough complexity in order to allow stable training on a wide range of integrands.Although we also use LeakyReLU activation functions, we additionally add an input activation layer which scales the input and removes any offset to increase stability of trainging.Additionally, we implement the optimal masking function proposed by i-flow [50].

Neural importance sampling
Neural importance sampling was introduced in the context of computer graphics [64] and proposes to use normalizing flows as a family of probability distributions over which to solve the minimization problem of importance sampling.
Of course, to actually do so, one needs to find a way to explicitly evaluate L(θ) and the original neural importance sampling approach proposes to approximate it using importance sampling.One needs to be careful that the gradient of the estimator of the loss need not be the estimator of the gradient of the loss.The gradient of the loss can be expressed as for which an estimator is proposed as (2.10) The authors also observed that other loss functions are possible which share the same global minimum as the variance based loss: for example, the Kullback-Leibler divergence D KL between two functions is also minimized when they are equal.Such alternative loss functions are not guaranteed to work for importance sampling, but they prove quite successful in practice.Indeed, for certain processes they are shown to be crucial for optimal performance, as can be seen in section 4.3.After training to minimize the loss estimator of eq.(2.10), the normalizing flows provides a tractable probability distribution function from which to sample points and estimate the integral.

Concepts and algorithms
In this section we describe the original contributions of this paper.The major conceptual innovation we provide in ZüNIS is a more flexible and data-efficient way of training normalizing flows in the context of importance sampling.This relies on a more rigorous formulation of the connection between the theoretical expression of ideal loss functions in terms of integrals and their practical realisations as random estimators than in previous literature, combined with an adaptive strategy to switch between these realistions.We describe this improvement in section 3.1.We also give a high-level overview of the organisation of the ZüNIS library, which implements this new training procedure.

Efficient training for importance sampling
In this section, we describe how we train probability distributions within ZüNIS using gradient-based optimizers.While the solution proposed in the original formulation of NIS defined eq.(2.10) works and has been successfully used by i-flow, its main drawback is that it samples points from the same distribution that it tries to optimize.As a result, new points X i must be sampled from p after every gradient step, which is very inefficient for slow integrands.Our solution to this problem is to remember that the loss function is an integral, which can be evaluated by importance sampling using any PDF, not only p.We will therefore define an auxiliary probability distribution function q(x), independent from θ, from which we sample to estimate our loss function: This is the basis for the general method we use for training probability distributions within ZüNIS, described in algorithm 2. Because the sampling distribution is separated from the model to train, the same point sample can be reused for multiple training steps, which is not possible when using eq.(2.10).This approach is similiar to the "buffered training" used in MadNIS [54].This is particularly important for high-precision particle physics predictions that involve high-order perturbative calculations or complex detector simulations because function evaluations can be extremely costly.We show in section 4, in particular in fig. 4 that reusing data indeed has a very significant impact on data efficiency.After training, q is discarded and the integral is estimated from the optimized p.
The only constraint on q is that it yields a good enough estimate so that gradient steps improve the model.Much like in other applications of neural networks, we have found that stochastic gradient descent can yield good results despite noisy loss estimates.We propose three schemes for q: • A uniform distribution (survey_strategy="flat") • A frozen copy of the model, which can be updated once in a while1 (survey_strategy ="forward") • An adaptive scheme starting with a uniform distribution and switching to sampling from a frozen model when it is expected to yield a more accurate loss estimate (survey_strategy="adaptive_variance").
An important point to notice is that the original NIS formulation in eq.(2.10) can be restated as a limiting case of our approach.Indeed, if we take q to be a frozen version of the model p(x, θ 0 ), which we update everytime we sample points (setting N = 1 in algorithm 2), the gradient update in line 9 is The core difference to the "buffered training" approach presented by MadNIS is that we offer an adaptive scheme, which automatically switches between which distribution is chosen to sample from.Using a uniform distribution in the beginning ensures that independent of initialisation every relevant region of the integration space is covered.However, with progressing training the frozen model becomes more desirable.In the adaptive training approaches, we switch once both losses become comparable.This is a crucial improvement, as it allows more stable training even without prior knowledge of the integrand.Additionally, a good coverage at early times of the training can speed up the training process.

The ZüNIS library
On the practical side, ZüNIS is a PyTorch-based library which implements many ideas formulated in previous work but organizes them in the form of a modular software library with an easy-to-use interface and well-defined building blocks.We believe this structure will help non-specialist use it without understanding all the nuts and bolts, while experts can easily add new features to responds to their needs.The ZüNIS library relies on three layers of abstractions which steer the different aspects of using normalizing flows to learn probability distributions from un-normalized functions and compute their integrals: • Flows, which implement a bijective mapping which transforms points and computes the corresponding Jacobian are described in appendix F.1 • Trainers, which provide the infrastructure to perform training steps and sample from flow models are described in appendix F.2 • Integrators, which use trainers to steer the training of a model and compute integrals are described in appendix F.3

Results
In this section, we evaluate ZüNIS on a variety of test functions to assess its performance and compare it to the commonly used VEGAS algorithm [6,65].Although there have been recent advances in computing both matrix elements and the integration using VEGAS on GPU [66], we focus on the predominant case of these steps being performed on CPU.We first produce a few low dimensional examples for illustrative purposes, then move on to integrating parametric functions in various dimensions and finally evaluate performance on particle scattering matrix elements.

Low-dimensional examples
Let us start by illustrating the effectiveness of ZüNIS in a low dimensional setting where we can readily visualize results.We define three functions on the 2D unit hypercube which each illustrate some failure mode of VEGAS (see appendix C).
We ran the ZüNIS Integrator with default arguments over ten repetitions for each function and report the performance of the trained integral estimator compared to a flat estimator and to VEGAS in table 1.Overall, ZüNIS Integrators learn to sample from their target function extremely well: we outperform VEGAS by a factor 100 for the camel and the slashed circle functions and a factor 30 for the sinusoidal function and VEGAS itself provides no advantage over uniform sampling for the latter two.
We further illustrate the performance of our trained models by comparing the target functions and density histogram for points sampled from the normalizing flows in fig. 1, which shows great qualitative agreement.

Systematic benchmarks
Let us now take a more systematic approach to benchmarking ZüNIS.We compare ZüNIS Integrators against uniform integration and VEGAS using the following metrics: integrand variance (a measure of convergence speed, see section 2.1), unweighting efficiency (a measure of the efficiency of exact sampling with rejection, see appendix A) and wall-clock training and sampling2 .
ZüNIS improves convergence rate compared to VEGAS For this experiment, we focus on the camel function defined in eq.(C.1) and scan a 35 configurations spanning from 2 to 32 dimensions over function variances between 10 −2 and 10 2 as shown in table 3. Except in the low variance limit, ZüNIS can reduce the required number of points sampled to attain a given precision on integral estimates without any parameter tuning, attaining speed-ups of up to ×1000 both compared to uniform sampling and VEGASbased importance sampling, as shown in fig.2a-2b and table 4. Compared with the results reported in Table II of [50], we see also a significant improvement with respect to i-flow.i-flow reports only moderate reduction of needed function calls up to one third, whereas we find in the same range of dimensions and integrand variances (multiple) order of magnitude reductions in variance, which shows the strength of our approach.Unweighting efficiencies are also boosted significantly, although more mildly than variances, as shown in fig.2c-2d, which we could attribute to PDF underestimation in regions with low point density; the nature of the veto algorithm makes it very sensitive to a few bad behaving points in the whole dataset.
ZüNIS is slower than VEGAS ZüNIS does not, however, outclass VEGAS on all metrics by far: as shown in fig.3, training is a few hundred times slower than VEGAS and sampling is 10-50 times slower, all while ZüNIS runs on GPUs.This is to be expected given the much increased computational complexity of normalizing flows compared to the VEGAS algorithm.As such, ZüNIS is not a general replacement for VEGAS, but provides a clear advantage for integrating time-intensive functions, where sampling is a negligible overhead, such as precision high-energy-physics simulations.
The new loss function introduced in ZüNIS improves data efficiency We have shown that ZüNIS is a very performant importance sampling and event generation tool and provides significant improvements over existing tools, while requiring little fine tuning from users.Another key result is that the new approach to training we introduced in section 3.1 has a large positive impact on performance.Indeed, as can be seen in fig.4, re-using samples for training over multiple epochs provides a 2-to 10-fold increase in convergence speed, making training much more data-efficient.
For this experiment, we use forward sampling, where the frozen model is used to sample a batch of points which are then used for training over multiple epochs before resampling from an update of the frozen model.As a result, we reproduce the original formulation of NIS in eq.(2.10) when we use a single epoch as shown in eq.(3.2).

MadGraph cross section integrals
Cross-sections are integrals of quantum transition matrix elements for a a scattering process such as a LHC collision and express the probability that specific particles are produced.Matrix elements themselves are un-normalized probability distributions for the configuration of the outgoing particles: it is therefore both valuable to integrate them to know the overall frequency of a given scattering process, and to sample from them to understand how particles will be spatially distributed as they fly off the collision point.
We study the performance of ZüNIS in comparison to VEGAS by studying three simple processes at leading order in perturbation theory, e − µ → e − µ via Z, d d → d d via Z and uc → ucg (with 3-jet cuts based on ∆R), see table 2 and fig. 5. We use the first process as a very easy reference while the two other, quark-initiated processes are used to illustrate specific points.Indeed, both feature narrow regions of their integration space with large enhancements, due respectively to Z-boson resonances and infra-red divergences.We evaluate the matrix elements for these three processes by using the Fortran standalone interface of MadGraph5_aMC@NLO [1].The two hadronic processes are convolved with parton-distribution functions from LHAPDF6 [67].We parametrize phase space (the particle configuration space) using the RAMBO on diet algorithm [68] implemented for PyTorch in TorchPS [69].Using RAMBO on diet has the advantage that no knowledge about the resonance structure of the integrand is required.However, this comes at a cost of reduced performance, as resonances are not reflected appropriately in the phase space sampling.It has been shown this can have a substantial effect on performance [70].
We report benchmark results in fig.6, in which we trained over 500,000 points for each process using near-default configuration, scanning only over variance and Kullback-Leibler losses.As previously observed, little convergence acceleration is achieved for low variance integrands like e − µ → e − µ, but unweighting still benefits from NIS.The two hadronic processes illustrate typical features for cross sections: training performance is variable and different processes are optimized by different loss function choices 3 .
The performance of d d → d d shows nice improvement with ZüNIS while that of uc → ucg is more limited.This can be understood by comparing to importance sampling (see appendix D.3): it is in fact VEGAS, which performs significantly better on uc → ucg compared to d d → d d because the parametrization of RAMBO is based on splitting invariant masses, making them aligned with the enhancements in the ucg phase space and allowing great VEGAS performance.This drives a key conclusion for the potential role of ZüNIS in the HEP simulation landscape: not to replace VEGAS, but to fill in the gaps where it fails due to inadequate parametrizations, as we illustrate here by using non-multichanneled d d → d d as a proxy for more complex processes.

Conclusion
We have showed that ZüNIS can outperform VEGAS both in terms of integral convergence rate and unweighting efficiency on specific cases, at the cost of a significant increase in training and sampling time, which is an acceptable tradeoff for high-precision HEP computations with high costs.In this context, the introduction of efficient training is a key element to making the most of the power of neural importance sampling where function evaluation costs are a major concern.While further testing is required to ascertain how far NIS can fill the gaps left by VEGAS for integrating complex functions, there is already good evidence that ZüNIS can provide needed improvements in specific cases.We hope that the publication of a usable toolbox for NIS such as ZüNIS will stir a wider audience within the HEP community to apply the method so that the exact boundaries its applicability can be more clearly ascertained.
• benchmark_madgraph/ex_benchmark_dd.sh to generate d d → d d via Zintegration data • benchmark_madgraph/ex_benchmark_ucg.sh to generate uc → ucg integration data These scripts assume that 5 CUDA GPUs are available and run 5 benchmarks in parallel.If fewer GPUs are available, it is recommended to modify the scripts to run the benchmarking scripts sequentially (by removing the ampersand) and to adapt the --cuda=N option.
to which we refer respectively as the camel, slashed circle and sinusoidal target functions.We set their parameters as follows We chose these functions because they illustrate different failure modes of the VEGAS algorithm: • Because of the factorized PDF of VEGAS, the camel function leads to 'phantom peaks' in the off diagonal.This problem grows exponentially with the number of dimensions but can be addressed by a change of variable to align the peaks with an axis of integration.
• The sinuoidal function makes it nearly impossible for VEGAS to provide any improvement: the marginal PDF over each variable of integration is nearly constant.Again this type of issue can be addressed by a change of variable provided one knows how to perform it.
• The slashed circle function is an example of a hard-for-VEGAS function that cannot be improved significantly by a change of variables.One can instead use multi-channeling, but this requires a lot of prior knowledge and has a computational costs since each channel is its own integral.For this process, both the speed-up and the unweighting efficiency ratio clearly favor the variance loss again, which outperforms in both metrics the D KL loss when multiple epochs are used, as can be seen in 10.The richer structure of the integrand reduces the effect of overfitting.Therefore, the performance increases or stays approximately constant except for the combination of D KL loss and the forward survey strategy.For the variance loss, it becomes apparent that the flat survey strategy increases much slower in performance than alternative strategies as a function of the number of epochs.However, for combination of losses and survey strategies, VEGAS could be outperformed substantially in terms of integration speed.

D.1 Qualitative examples
An opposite picture is drawn by the process uc → ucg in figure 11, for which, apart from the flat survey strategy, D KL loss is in general favored both for integration speed as well as unweighting efficiency ratio.The adaptive survey strategy is here giving the best results, although for a high number of epochs causes overfitting for the unweighting efficiency.The take-home message of this section is, one the one hand, that the flat survey strategy is in general not recommended.Apart from this, the most important mean to improve the quality of importance sampling are testing whether, independent of the survey strategy, the loss function should be chosen differently.

E Exact minimzation of the neural importance sampling estimator variance
Let us show that the optimal probability distribution for importance sampling is the function itself.Explicitly, as discussed in section 2.1, we want to find the probability distribution p defined over some domain Ω which minimizes the variance of the estimator f (X)/p(X), for X ∼ p(X).We showed that this amounts to solving the following optimization problem: which we can encode using Lagrange multipliers We can now solve this problem by finding extrema with functional derivatives which can be made arbitrarily large by sending α to 0.
F High-level concepts of the ZüNIS API As defined in section 2.2, a normalizing flow in ZüNIS is a bijective mapping between two d dimensional spaces, which in practice are always the unit hypercube [0, 1] d or R d , with a tractable Jacobian so that it can be used to map probability distributions.To this end, the GeneralFlow interface defines normalizing flows as a callable Python object which acts on batches of point drawn from a known PDF p.A batch of points x i with their PDF values is encoded as a Pytorch Tensor object X organized as follows where each X i corresponds to a points stacked with its negative log density Encoding point densities by their negative logarithm makes their transformation under normalizing flows additive.Indeed if we have a mapping Q with Jacobian determinant j Coupling Cells are flows defined by an element-wise transform and a mask.All flow models used in ZüNIS in practice are implemented as a sequence of coupling cell transformations acting on a subset of the variables.The abstract class GeneralCouplingCell and its child InvertibleCouplingCell specifies the general organization of coupling cells as needing to be instantiated with • a dimension d • a mask defined as a list of boolean specifying which coordinates are transformed or not • a transform that implements the mapping of the non-masked variables In release v1.0 of ZüNIS two such classes are provided: PWLinearCoupling and PWQuadrat-icCoupling, which respectively implement the piecewise linear and piecewise quadratic coupling cells proposed in [64].New coupling cells can easily be implemented, as explained in appendix F.4.Both existing classes rely on dense neural networks for the prediction of the shape of their one-dimensional piecewise-polynomial mappings, whose parameters are set at instantiation.
Here is how one can use a piecewise-linear coupling cell for sampling points We provide further details of the use and possible parameters of flows in the documentation of ZüNIS: https://zunis.readthedocs.io/en/stable/.

F.2 Training with the Trainer class
The design of the ZüNIS library intentionally restricts Flow models to being bijective mappings instead of being ways to evaluate and sample from PDFs so as not to restrict their applicability (see [71] for an example).The specific application in which one uses a normalizing flow, and in our case how precisely one samples from it, is intimately linked to how such a model is trained.As a result, ZüNIS bundles together the low-level training tools for Flow models together with sampling tools inside the Trainer classes.
The general blueprint for such classes is defined in the GenericTrainerAPI abstract class while the main implementation for users is provided as StatefulTrainer.At instantiation, all trainers expect a Flow model and flow_prior which samples point from a fixed PDF in latent space.These two elements together define a probability distribution over the target space from which one can sample.
There are two main ways one interacts with Trainers: • One can sample points from the PDF defined by the model and the prior using the sample_forward method.
• One can train over a pre-sample batch of points, their sampling PDF and the corresponding function values using train_on_batch(self, x, px, fx) In practice, we expect that the main way users will use Trainers is for sampling pretrained models.In the context of particle physics simulations for example, unweighting is a common task, which aims at sampling exactly from a known function f .A common approach is the Hit-or-miss algorithm [72], whose efficiency is improved by sampling from a PDF approaching f .This is how one would use a trainer trained on f :

F.3 Integrating with the Integrator class
Integrators are intended as the main way for standard users to interact with ZüNIS.They provide a high-level interface to the functionalities of the library and only optionally require users to know to what lower levels of abstractions really entail and to what their options correspond.From a practical point of view, the main interface of ZüNIS for integration is implemented as the Integrator, which is a factory function that instantiates the appropriate integrator class based on a choice of options.
All integrators follow the same pattern, defined in the SurveyRefineIntegratorAPI and BaseIntegrator abstract classes.Integration start by performing a survey phase, in which it optimizes the way it samples points and then a refine phase, in which it computes the integral by using its learned sampler.Each phase proceeds through a number of steps, which can be set at instantiation or when integrating: # Setting values at instantiation time integrator = Integrator(d=d, f=f, n_iter_survey=3, n_iter_refine=5) # Override at integration time integral_data = integrator.integrate(n_survey=10,n_refine=10) For both the survey and the refine phases, using multiple steps is useful to monitor the stability of the training and of the integration process: if one step is not within a few standard deviations of the next, either the sampling statistics are too low, or something is self.T = ArbitraryShapeRectangularDNN(d_in=d_in, out_shape=(1,), d_hidden=nn_width, n_hidden=nn_depth)

G Hardware setup details
The computations presented in this work were performed on a computing cluster using a Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz with 376 GB RAM.Processes which could be performed on the GPU were done on a GeForce RTX 2080 having 12 GB memory and running on CUDA 11.0.

Figure 1 :
Figure 1: Comparison between target functions and point sampling densities for 1a the camel function, 1b the sinusoidal function, 1c the slashed circle function.Supplementary fig.7 shows how points are mapped from latent to target space.

Figure 2 :
Figure 2: Benchmarking ZüNIS against uniform sampling and VEGAS with default settings.In (2a-2b), we show the sampling speed-up (ratio of integrand variance) as a function of the relative standard deviation of the integrand, while we show the unweighting speed-up (ratio of unweighting efficiencies) in (2c-2d).

Figure 3 :
Figure 3: Comparison of the training and sampling speed of ZüNIS and VEGAS.As can be expected, ZüNIS is much slower than VEGAS, both for training and sampling, although larger batch sizes can better leverage the power of hardware accelerators.

Figure 4 :
Figure 4: Figure 4a: Effect of repeatedly training on the same sample of points over multiple epochs.For all settings, there is a large improvement when going from one to moderate epoch counts, with a peak around 5-10.Larger number of epochs lead to overfitting, which impacts performance negatively.Figure 4b: Comparison between optimal data reuse (5 epochs) and the original NIS algorithm (1 epoch).
Figure 4: Figure 4a: Effect of repeatedly training on the same sample of points over multiple epochs.For all settings, there is a large improvement when going from one to moderate epoch counts, with a peak around 5-10.Larger number of epochs lead to overfitting, which impacts performance negatively.Figure 4b: Comparison between optimal data reuse (5 epochs) and the original NIS algorithm (1 epoch).

Figure 5 :
Figure 5: Sample Feynman Diagrams for e − µ → e − µ via Z, d d → d d via Z and uc → ucg .

Figure 6 :
Figure 6: Average performance of ZüNIS over 20 runs relative to VEGAS, measured by the relative speed-up in fig.6a and by the relative unweighting efficiency in fig.6b.

Figure 7 :
Figure 7: Mapping between the uniform point density in the latent space and the target distribution for the camel function, the sinusoidal function, the slashed circle function.Points are colored based on their position in latent space.

Figure 9 :
Figure 9: Median of the performance of ZüNIS over 20 runs relative to VEGAS for the process e − µ → e − µ via Z depending on the loss function, the survey strategy and number of epochs, measured by the relative speed-up in 9a and by the relative unweighting efficiency in 9b.

Figure 10 :
Figure 10: Median of the performance of ZüNIS over 20 runs relative to VEGAS for the process d d → d d via Z depending on the loss function, the survey strategy and number of epochs, measured by the relative speed-up in 10a and by the relative unweighting efficiency in 10b.

Figure 11 :
Figure 11: Median of the performance of ZüNIS over 20 runs relative to VEGAS for the process uc → ucg depending on the loss function, the survey strategy and number of epochs, measured by the relative speed-up in 11a and by the relative unweighting efficiency in 11b.

F. 1
Normalizing flows with Flow classesFlows map batches of points and their densities.The ZüNIS library implements normalizing flows by specifying a general interface defined as a Python abstract class:GeneralFlow.All flow models in ZüNIS are child classes of GeneralFlow, which itself inherits from the Pytorch nn.Module interface.

Table 2 :
e − µ → e − µ via Z d d → d d via Z uc → ucg Comparison of the three test processes.

Table 3 :
Setup of the 35 different camel functions considered to benchmark ZüNIS.We scan over function relative standard deviations, which correspond to different σ parameters for each dimension(eq.(C.1)).We provide the corresponding width of a 1D gaussian (σ 1d ) with the same variance for reference.
Furthermore, this extremum is certainly a minimum because the loss function is positive and unbounded.Indeed, if we separate Ω into two disjoint measurable subdomains Ω 1 and Ω 2 , and define p α (x) such that points are drawn uniformly over Ω 1 with probability α and uniformly over Ω 2 with probability 1 − α, then the resulting loss function would be