Parallel adaptive Monte Carlo integration with the event generator WHIZARD
Abstract
We describe a new parallel approach to the evaluation of phase space for MonteCarlo event generation, implemented within the framework of the Whizard package. The program realizes a twofold selfadaptive multichannel parameterization of phase space and makes use of the standard OpenMP and MPI protocols for parallelization. The modern MPI3 feature of asynchronous communication is an essential ingredient of the computing model. Parallel numerical evaluation applies both to phasespace integration and to event generation, thus covering computingintensive parts of physics simulation for a realistic collider environment.
1 Introduction
MonteCarlo event generators are an indispensable tool of elementary particle physics. Comparing collider data with a theoretical model is greatly facilitated if there is a sample of simulated events that represents the theoretical prediction, and can directly be compared to the event sample from the particle detector. The simulation requires two steps: the generation of particlelevel events, resulting in a set of particle species and momentum fourvectors, and the simulation of detector response. To generate particlelevel events, a generator computes partonic observables and partonic event samples which then are dressed by parton shower, hadronization, and hadronic decays. In this paper, we focus on the efficient computation of partonic observables and events.
Hard scattering processes involve StandardModel (SM) elementary particles – quarks, gluons, leptons, \(\hbox {W}^{\pm }\), Z, and Higgs bosons, and photons. The large number and complexity of scattering events recorded at detectors such as ATLAS or CMS, call for a matching computing power in simulation. Parallel evaluation that makes maximum use of available resources is an obvious solution.
The dominant elementary processes at the Large Hadron Collider (LHC) can be described as \(2\rightarrow 2\) or \(2\rightarrow 3\) particle production, where resonances in the final state subsequently decay, and additional jets can be accounted for by the parton shower. Cross sections and phasespace distributions are available as analytic expressions. Since distinct events are physically independent of each other, parallel evaluation is done trivially by generating independent event samples on separate processors. In such a situation, a parallel simulation run on a multicore or multiprocessor system can operate close to optimal efficiency.
However, the LHC does probe rarer partonic processes which are of the type \(2\rightarrow n\) where \(n\ge 4\). There are increasing demands on the precision in data analysis at the LHC and, furthermore, at the planned future highenergy and highluminosity lepton and hadron colliders. This forces the simulation to go beyond the leading order in perturbation theory, beyond the separation of production and decay, and beyond simple approximations for radiation. For instance, in processes such as topquark pair production or vectorboson scattering, the simulation must handle elementary \(n=6,8,10\) processes.
Closed analytical expressions for phasespace distributions with high multiplicity exist for uniform phasespace population algorithms [1, 2]. However, for arbitrary phasespace distributions with high multiplicity, such closed analytical expressions are not available. A standard ansatz is to factorize the phase space into Lorentzinvariant twobody phase spaces, which are individually parameterized and kinematically linked by a chain of Lorentz boosts.
Event generators for particle physics processes therefore rely on statistical methods both for event generation and for the preceding timeconsuming numerical integration of the phase space. All major codes involve a MonteCarlo rejection algorithm for generating unweighted event samples. This requires knowledge of the total cross section and of a reference limiting distribution over the whole phase space. Calculating those quantities relies again on MonteCarlo methods, which typically involve an adaptive iteration algorithm. A large fraction of the total computing time cannot be trivially parallelized since it involves significant communication. Nevertheless, some of the main MC event generators have implemented parallelization features, e.g., Sherpa [3], MG5_aMC@NLO [4] as mentioned in [5], MATRIX [6] and MCFM [7].
In the current paper, we describe a new approach to efficient parallelization for adaptive Monte Carlo integration and event generation, implemented within the Whizard [8] MonteCarlo integration and eventgeneration program. The approach combines independent evaluation on separate processing units with asynchronous communication via MPI 3.1 with internally parallelized loops distributed on multiple cores via OpenMP. In Sect. 2, we give an overview of the workflow of the Whizard event generation framework with the computingintensive tasks that it has to perform. Sect. 3 describes the actual MC integration and event generation algorithm. The parallelization methods and necessary modifications to the algorithm are detailed in Sect. 4. This section also shows our study on the achievable gain in efficiency for typical applications in highenergy physics. Finally, we conclude in Sect. 5.
2 The WHIZARD multipurpose event generator framework
We will demonstrate our parallelization algorithms within the Whizard framework [8]. Whizard is a multipurpose MonteCarlo integration and event generator program. In this Section, we describe the computingintensive algorithms and tasks which are potential targets of improvement via parallel evaluation. In order to make the section selfcontained, we also give an overview of the capabilities of Whizard.
The program covers the complete workflow of particlephysics calculations, from setting up a model Lagrangian to generating unweighted hadronic event samples. To this end, it combines internal algorithms with external packages. Physics models are available either as internal model descriptions or via interfaces to external packages, e.g. for FeynRules [9]. For any model, scattering amplitudes are automatically constructed and accessed for numerical evaluation via the included matrixelement generator O’Mega [10, 11, 12, 13, 14, 15]. The calculation of partonic cross sections, observables, and events is handled within the program itself, as detailed below. Generated events are showered by internal routines [16], showered and hadronized by standard interfaces to external programs, or by means of a dedicated interface to Pythia [17, 18]. For nexttoleadingorder (NLO) calculations, Whizard takes virtual and color/charge/spincorrelated matrix elements from the oneloop providers OpenLoops [19, 20], GoSam [21, 22], or RECOLA [23], and handles subtraction within the Frixione–Kunszt–Signer scheme (FKS) [24, 25, 26, 27]. Selected Whizard results at NLO, some of them obtained using parallel evaluation as presented in this paper, can be found in Refs. [28, 29, 30, 31, 32].
 1.
Construct, compile, and link matrix element code, as far as necessary for evaluation.
 2.
Compute an approximation to the phasespace integral for given input data (collider energy, etc.), using the current state of phasespace parameterization (integration grids, cf. below).
 3.
 (a)
Optimize the integrationgrid data based on the information gathered in step 2, discard all previous integration results and repeat from step 2 with the new parameterization, or
 (b)
Record the integration result and statistical uncertainty from 2 as final. If requested, proceed with step 4.
 (a)
 4.
Take the results and grid data from 3b to generate and store a partonic event sample, optionally transformed into a physical (hadronic) event sample in some public event format.
We note that there is no technical distinction between warmup and genuine integration phases, but the workflow does separate the optional final step of generating an event sample. Anticipating the detailed discussion of parallel evaluation, we note that steps 2 and 3a are both rather nontrivial if distributed among separate workers. The multichannel approach of parameterizing phase space and the Vamp algorithm combined call for correlating the selection of phasespace points for sampling, parallel single matrixelement evaluation, communication of Jacobians and other information between workers, and reducing remotely accumulated information during the adaptation step 3a. This problem and our approach to an efficient solution constitute the main subject of the current paper.
In concrete terms, the core part of Whizard is the phasespace parameterization, computing values for integrals and the distribution of variance, and the iterative optimization of the parameterization. The phase space is the manifold of kinematically allowed energymomentum configurations of the external particles in an elementary process. Userdefined cuts may additionally constrain phase space in rather arbitrary ways. Whizard specifically allows for arbitrary phasespace cuts, that can be steered from the input file via its scripting language Sindarin without any (re)compilation of code. The program determines a set of phasespace parameterizations (called channels), i.e., bijective mappings of a subset of the unit ddimensional hypercube onto the phasespace manifold. For the processes of interest, d lies between 2 and some 25 dimensions. Note that the parameterization of nontrivial beam structure in form of parton distribution functions, beam spectra, electron structure functions for initialstate radiation, effective photon approximation, etc., provides additional dimensions to the numerical integration. The actual integrand, i.e., the square of a transition matrix element evaluated at the required order in perturbation theory, is defined as a function on this cut phase space. It typically contains sharp peaks (resonances) and develops poles in the integration variables just beyond the boundaries. We collectively denoted those as “singularities” in a slight abuse of language. In effect, the numerical value of the integrand varies over many orders of magnitude.
For an efficient integration, it is essential that the program generates multiple phasespace channels for the same process. Each channel has the property that a particular subset of the singularities of the integrand maps to a slowly varying distribution (in the ideal case a constant) along a coordinate axis of the phasespace manifold with that specific mapping. The set of channels has to be large enough that it covers all numerically relevant singularities. This is not a welldefined requirement, and Whizard contains a heuristic algorithm that determines this set. The number of channels may range from one (e.g. \(\hbox {e}^{+}\hbox {e}^{} \rightarrow \mu ^{+}\mu ^{}\) at \(\sqrt{s} = 40\) GeV, no beam structure) to some \(10^6\) (e.g. vector boson scattering at the LHC, or BSM processes at the LHC) for typical applications.
Finally, for completeness, we note that Whizard contains additional modules that implement other relevant physical effects, e.g., incomingbeam structure, polarization, factorizing processes into production and decay, and modules that prepare events for actual physics studies and analyses. To convert partonic events into hadronic events, the program provides its own algorithms together with, or as an alternative to, external programs such as Pythia. Data visualization and analysis can be performed by its own routines or by externally operating on event samples, available in various formats.
Before we discuss the parallelization of the phasespace integration, in the next section, Sect. 3, we explain in detail how the MC integration of Whizard works.
3 The MC integrator of WHIZARD: the VAMP algorithm
The implementation of the integration and event generation modules of Whizard is based on the Vegas algorithm [33, 34]. Whizard combines the Vegas method of adaptive MonteCarlo integration with the principle of multichannel integration [35]. The basic algorithm and a sample implementation have been published as Vamp (Vegas AMPlified) in [36]. In this section, in order to prepare the discussion of our parallelized reimplementation of the Vamp algorithm, we discuss in detail the algorithm and its application to phasespace sampling within Whizard. Our parallelized implementation for the purpose of efficient parallel evaluation is then presented in Sect. 4.
3.1 Integration by MonteCarlo sampling
This algorithm has become a standard choice for particlephysics computations, because (i) the error scaling law \(\propto 1/\sqrt{N}\) turns out to be superior to any other useful algorithm for large dimensionality of the integral d; (ii) by projecting on observables \(\mathcal{O}(\phi (x))\), any integrated or differential observable can be evaluated from the same event sample; and (iii) an event sample can be unweighted to accurately simulate the event sample resulting from an actual experiment. The unweighting efficiency \(\epsilon \) as in Eq. (2) again depends on the behavior of the effective integrand.
The optimal values \(a=0\) and \(\epsilon =1\) are reached if \(f(\phi (x))\,\phi '(x)\,\rho _\phi (x)\equiv 1\). In one dimension, this is possible by adjusting the mapping \(\phi (x)\) accordingly. The Jacobian \(\phi '\) should cancel the variance of the integrand f, and thus will assume the shape of this function. In more than one dimension, such mappings are not available in closed form, in general.
In calculations in particlephysics perturbation theory, the integrand f is most efficiently derived recursively, e.g. from oneparticle offshell wave functions like in [10]. The poles in these recursive structures are the resonant Feynman propagators. In particular, if for a simple process only a single propagator contributes, there are standard mappings \(\phi \) such that the mapped integrand factorizes into onedimensional functions, and the dominant singularities are canceled. For this reason, the phasespace channels of Whizard are constructed from the propagator structure of the most relevant contributions to the squared amplitude. If several such contributions exhibit a mutually incompatible propagator structure, mappings that cancel the singularities are available only for very specific cases such as massless QCD radiation, e.g. [38, 39]. In any case, we have to deal with some remainder variance that is not accounted for by standard mappings, such as polynomial factors in the numerator, higherorder contributions, or userdefined cuts which do not depend on the integrand f.
3.2 The VEGAS algorithm: importance sampling
The Vegas algorithm [34] addresses the frequent situation that the effective integrand \(f_\phi \) is not far from factorizable form, but the capabilities of finding an optimal mapping \(\phi \) in closed form have been exhausted. In that case, it is possible to construct a factorizable step mapping that improves accuracy and efficiency beyond the chosen \(\phi (x)\).
Several implementations of Vegas exist, e.g. Lepage’s FORTRAN77 implementation [34] or the GNU scientific library C implementation [40]. Here, we relate to the Vamp integration package as it is contained in Whizard. It provides an independent implementation which realizes the same basic algorithm and combines it with multichannel integration, as explained below in Sect. 3.4.
This is an optimization problem with a number of \(\smash {\sum _{k=1}^d (n_k1)}\) free parameters,^{1} together with a specific strategy for optimization. If successful, the numerical variance of the ratio \(f_\phi (x)/g(x)\) is reduced after each adaptation of g. In fact, the shape of g(x) will eventually resemble a histogrammed version of \(f_\phi (x)\), with a sawlike profile along each integration dimension. Bins will narrow along slopes of high variation in \(f_\phi \), such that the ratio \(f_\phi /g\) becomes bounded. The existence of such a bound is essential for unweighting events, since the unweighting efficiency \(\epsilon \) scales with the absolute maximum of \(f_\phi (x)/g(x)\) within the integration domain. Clearly, the value of this maximum can only be determined with some uncertainty since it relies on the finite sample \(\{x_i\}\). The sawlike shape puts further limits on the achievable efficiency \(\epsilon \). Roughly speaking, each direction with significant variation in \(f_\phi \) reduces \(\epsilon \) by a factor of two.
The set of updated parameters \(\varDelta x_{kj_k}\) defines the integration grid for the next iteration. In the particlephysics applications covered by Whizard we have \(d\lesssim 30\), the number of bins is typically chosen as \(n_k\lesssim 30\), all \(n_k\) equal, so a single grid consists of between a few and \(10^3\) parameters n subject to adaptation. In practice, the optimization strategy turns out to be rather successful. Adapting the grid a few times does actually improve the accuracy a and the efficiency \(\epsilon \) significantly. Only the grids from later passes are used for calculating observables and for event generation. Clearly, the achievable results are limited by the degree of factorizability of the integrand.
3.3 The VEGAS algorithm: (pseudo)stratified sampling
The importancesampling method guarantees, for a fixed grid, that the estimator \(E_N\) for an integrand approaches the exact integral for \(N\rightarrow \infty \). Likewise, a simulated unweighted event sample statistically approaches an actual observed event sample, if the integrand represents the actual matrix element.
However, the statistical distribution of the numbers \(x_i\) is a rather poor choice for an accurate estimate of the integral. In fact, in one dimension a simple equidistant binmidpoint choice for \(x_i\) typically provides much better convergence than \(1/\sqrt{N}\) for the random distribution. A reason for nevertheless choosing the MonteCarlo over the midpoint algorithm is the fact that for \(n_{\text {mid}}\) bins in d dimensions, the total number of cells is \(n_{\text {mid}}^d\), which easily exceeds realistic values for N: for instance, \(n_{\text {mid}}=20\) and \(d=10\) would imply \(n_{\text {mid}}^d=10^{13}\), but evaluating the integrand at much more than \(10^7\) points may already become infeasible.
The stratified sampling approach aims at combining the advantages of both methods. Binning along all coordinate axes produces \(n^d\) cells. Within each cell, the integrand is evaluated at precisely s distinct points, \(s\ge 2\). We may choose n such that the total number of calls, \(N=s \cdot n^d\), stays within limits feasible for a realistic sampling. For instance, for \(s=2\), \(d=10\), and limiting the number of calls to \(N\approx 10^7\), we obtain \(n=4\cdots 5\). Within each cell, the points are randomly chosen, according to a uniform distribution. Again, the Vegas algorithm iteratively adapts the binning in several passes, and thus improves the final accuracy.
For the problems addressed by Whizard, pure stratified sampling is not necessarily an optimal approach. The structure of typical integrands cannot be approximated well by the probability distribution g(x) if the number of bins per dimension n is small. To allow for larger n despite the finite total number of calls, the pseudostratified approach applies stratification not in x space, which is binned into \(n_x^d\) cells with \(n\lesssim 20\), but in r space which was not binned originally. The \(n_r\) bins in r space are not adapted, so this distribution stays uniform. In essence, the algorithm scans over all \(n_r^d\) cells in r space and selects two points randomly within each r cell, and then maps those points to points in x space, where they end up in any of the \(n_x^d\) cells. The overall probability distribution in x is still g(x) as given by Eq. (12), but the distribution has reduced randomness in it and thus yields a more accurate integral estimate.
Regardless of the integration algorithm, simulation of unweighted events can only proceed via strict importance sampling. Quantum mechanics dictates that events have to be distributed statistically independent of each other over the complete phase space. Therefore, Whizard separates its workflow into integration passes which adapt integration grids and evaluate the integral, and a subsequent simulation run which produces an event sample. The integration passes may use either method, while event generation uses importance sampling and, optionally, unweighting the generated events. In practice, using grids which have been optimized by stratified sampling is no disadvantage for subsequent importance sampling since both sampling methods lead to similarly shaped grids.
3.4 Multichannel integration
The adaptive MonteCarlo integration algorithms described above do not yield satisfactory results if the effective integrand \(f_\phi \) fails to factorize for the phasespace channel \(\phi \). In nontrivial particlephysics processes, many different Feynman graphs, possibly with narrow resonances, including mutual interference, contribute to the integrand.
Regarding particlephysics applications, a straightforward translation of (archetypical representatives of) Feynman graphs into integration channels can result in large values for the number of channels K, of order \(10^5\) or more. In fact, if the number of channels increases proportional to the number of Feynman graphs, it scales factorially with the number of elementary particles in the process. This is to be confronted with the complexity of the transitionmatrix calculation, where recursive evaluation results in a power law. Applied naively, multichannel phasespace sampling can consume the dominant fraction of computing time. Furthermore, if the multichannel approach is combined with adaptive binning (see below), the number of channels is multiplied by the number of grid parameters, so the total number of parameters grows even more quickly. For these reasons, Whizard contains a heuristic algorithm that selects a smaller set of presumably dominant channels for the multichannel integration. Since all parameterizations are asymptotically equivalent to each other regarding importance sampling, any such choice does not affect the limit \(E_N[f]\rightarrow I_\varOmega [f]\). It does affect the variance and can thus speed up – or slow down – the convergence of the integral estimates for \(N\rightarrow \infty \) and for iterative weight adaptation.
3.5 Doubly adaptive multichannel integration: VAMP
The actual integration algorithm is organized as follows. Initially, all channel weights and bin widths are set equal to each other. There is a sequence of iterations where each step consists of first generating a sample of N events, then adapting the free parameters. This adaptation may update either the channel weights via Eq. (24) or the grids via Eq. (14), or both, depending on user settings. The event sample is divided among the selected channels based on event numbers \(N_c\). For each channel, the integration hypercube in r is scanned by cells in terms of stratified sampling, or sampled uniformly (importance sampling). For each point \(r_c\), we compute the mapped point \(x_c\), the distribution value \(g_c(x_c)\), and the phasespace density \(\rho _c(x_c)\) at this point. Given the fixed mapping \(\phi _c\), we compute the phasespace point p and the Jacobian factor \(\phi '_c\). This allows us to evaluate the integrand f(p). Using p, we scan over all other channels \(c'\ne c\) and invert the mappings to obtain \(\phi '_{c'}\), \(x_{c'}\), \(g_{c'}(x_{c'})\), and \(\rho _{c'}(x_{c'})\). Combining everything, we arrive at the effective weight \(w=f_c^g(x_c)\) for this event. Accumulating events and evaluating mean, variance, and other quantities then proceeds as usual. Finally, we may combine one or more final iterations to obtain the best estimate for the integral, together with the corresponding error estimate.
If an (optionally unweighted) event sample is requested, Whizard will take the last grid from the iterations and sample further events, using the same multichannel formulas, with fixed parameters, but reverting to importance sampling over the complete phase space. The channel selection is then randomized over the channel weights \(\alpha _c\), allowing for an arbitrary number of simulated physical events.
4 Parallelization of the WHIZARD workflow
In this section we discuss the parallelization of the Whizard integration and event generation algorithms. We start with a short definition of observables and timings that allow to quantify the gain of a parallelization algorithm in Sect. 4.1. Then, in Sect. 4.2, we discuss the computing tasks for a typical integration and event generation run with the Whizard program, while in Sect. 4.3 we list possible computing frameworks for our parallelization tasks and what we chose to implement in Whizard. Random numbers have to be set up with great care for parallel computations, as we point out in Sect. 4.4. The Whizard algorithm for parallelized integration and event generation is presented in all details in Sect. 4.5. Finally, in Sect. 4.6, we introduce an alternative method to generate the phasespace parameterization that is more efficient for higher finalstate particle multiplicities and is better suited for parallelization.
4.1 Basics
The challenges of parallelization are thus twofold: (i) increase the fraction \(T_m/T_s\) by parallelizing all computingintensive parts of the program. For instance, if \(T_s\) amounts to \(0.1\%\) of \(T_m\), the plateau is reached for \(n=1000\) workers. (ii) make sure that at this saturation point, \(T_c(n)\) is still negligible. This can be achieved by (a) choose a communication algorithm where \(T_c\) increases with a low power of n, or (b) reduce the prefactor in \(T_c(n)\), which summarizes the absolute amount of communication and blocking per node.
4.2 Computing tasks in WHIZARD
The computing tasks performed by Whizard vary, and crucially depend on the type and complexity of the involved physics processes. They also depend on the nature of the problem, such as whether it involves parton distributions or beam spectra, the generation of event files, or scans over certain parameters, or whether it is a LO or NLO process. In order to better visualize the flow of computing tasks in Whizard, we show the screen output for a simple \(2\rightarrow 2\) process in Fig. 1.
To begin with, we therefore identify the major parts of the program and break them down into sections which, in principle, can contribute to either \(T_s\) (serial), \(T_m\) (parallel), or \(T_c\) (communication).
Sindarin All user input is expressed in terms of Sindarin expressions, usually collected in an input file. Interpreting the script involves preprocessing which partly could be done in parallel. However, the Sindarin language structure allows for mixing declarations with calculation, so parallel preprocessing can introduce nontrivial communication. Since scripts are typically short anyway, we have not yet considered parallel evaluation in this area. This also applies for auxiliary calculations that are performed within Sindarin expressions.
Models Processing model definitions is done by programs external to the Whizard core in advance. We do not consider this as part of the Whizard workflow. Regarding reading and parsing the resulting model files by Whizard, the same considerations apply as for the Sindarin input. Nevertheless, for complicated models such as the MSSM, the internal handling of model data involves lookup tables. In principle, there is room for parallel evaluation. This fact has not been exploited, so far, since it did not constitute a bottleneck.
Process construction Process construction with Whizard, i.e., setting up data structures that enable matrixelement evaluation, is delegated to programs external to the Whizard core. For treelevel matrix elements, the inhouse O’Mega generator constructs Fortran code which is compiled and linked to the main program. For loop matrix elements, Whizard relies on programs such as GoSam, RECOLA, or OpenLoops. The parallelization capabilities rely on those extra programs, and are currently absent. Therefore, processconstruction time contributes to \(T_s\) only.
Phasespace construction Up to Whizard version 2.6.1, phasespace construction is performed internally with Whizard (i.e. by the Whizard core), by a module which recursively constructs data structures derived from simplified Feynman graphs. The algorithm is recursive and does not lead to obvious parallelization methods; the resulting \(T_s\) contribution is one of the limiting factors.
A new algorithm, which is described below in Sect. 4.6, reuses the data structures from process construction via O’Mega. The current implementation is again serial (\(T_s\)), but significantly more efficient. Furthermore, since it does not involve recursion it can be parallelized if the need arises.

Initialization. This part involves serial execution. If subsequent calculations are done in parallel, it also involves communication, once per process.

Randomnumber generation. The Vamp integrator relies on randomnumber sequences. If we want parallel evaluation on separate workers, the randomnumber generator should produce independent, reproducible sequences without the necessity for communication or blocking.

Vamp sampling. Separate sampling points involve independent calculation, thus this is a preferred target for turning serial into parallel evaluation. The (pseudo) stratified algorithm involves some management, therefore communication may not be entirely avoidable.

Phasespace kinematics. Multichannel phasespace evaluation involves two steps: (i) computing the mapping from the unit hypercube to a momentum configuration, for a single selected phasespace channel, and (ii) computing the inverse mapping, for all other phasespace channels. The latter part is a candidate for parallel evaluation. The communication part involves distributing the momentum configuration for a single event. The same algorithm structure applies to the analogous discrete mappings introduced by the Vamp algorithm.

Structure functions. The external PDF library for hadron collisions (Lhapdf) does not support intrinsic parallel evaluation. This also holds true for the inhouse Circe1/Circe2 beamstrahlung library.

Matrixelement evaluation. This involves sums over quantum numbers: helicity, color, and particle flavor. These sums may be distributed among workers. The tradeoff of parallel evaluation has to be weighted against the resulting communication. In particular, common subexpression elimination or caching partial results do optimize serial evaluation, but actually can inhibit parallel evaluation or introduce significant extra communication.

Grid adaptation. For a grid adaptation step, results from all sampling points within a given iteration have to be collected, and the adapted grids have to be sent to the workers. Depending on how grids are distributed, this involves significant communication. The calculations for adapting grids consume serial time, which in principle could also be distributed.

Integration results. Collecting those is essentially a byproduct of adaptation, and thus does not involve extra overhead.

Sampling is done in form of strict importance sampling. This is actually simpler than sampling for integration.

Events are further transformed or analyzed. This involves simple kinematic manipulations, or complex calculations such as parton shower and hadronization. The modules that are used for such tasks, such as Pythia or Whizard ’s internal module, do not support intrinsic parallelization. Generating histograms and plots involves communication and some serial evaluation.

Events are written to file. This involves communication and serial evaluation, either event by event, or by combining event files generated by distinct workers.
Parameter scans Evaluating the same process(es) for different sets of input data, can be done by scripting a loop outside of Whizard. In that case, communication time merely consists of distributing the input once, and collecting the output, e.g., fill plots or histograms. However, there are also contributions to \(T_s\), such as compile time for process code. Alternatively, scans can be performed using Sindarin loop constructs. Such loops may be run in parallel. This avoids some of the \(T_s\) overhead, but requires communicating Whizard internal data structures. Phasespace construction may contribute to either \(T_s\) or \(T_m\), depending on which input differs between workers. Process construction and evaluation essentially turns into \(T_m\). This potential has not been raised yet, but may be in a future extension. The benefit would apply mainly for simple processes where the current parallel evaluation methods are not efficient due to a small \(T_m\) fraction.
4.3 Paradigms and tools for parallel evaluation
 1.
MPI (messagepassing interface, cf. e.g. [53]). This protocol introduces a set of independent abstract workers which virtually do not share any resources. By default, the program runs on all workers simultaneously, with different data sets. (Newer versions of the protocol enable dynamic management of workers.) Data must be communicated explicitly via virtual buffers. With MPI 3, this communication can be set up asynchronously and nonblocking. The MPI protocol is well suited for computing clusters with many independent, but interconnected CPUs with local memory. On such hardware, communication time cannot be neglected. For Fortran, MPI is available in form of a library, combined with a special run manager.
 2.
OpenMP (open multiprocessing, cf. e.g. [54]). This protocol assumes a common address space for data, which can be marked as local to workers if desired. There is no explicit communication. Instead, dataexchange constraints must be explicitly implemented in form of synchronization walls during execution. OpenMP thus maps the configuration of a sharedmemory multicore machine. We observe that with such hardware setup, communication time need not exceed ordinary memory lookup. On the other hand, parallel execution in a sharedmemory (and sharedcache) environment can run into race conditions. Fortran compilers support OpenMP natively, realized via standardized preprocessor directives.
 3.
Coarrays (cf. e.g. [55]). This is a recent native Fortran feature, introduced in the Fortran2008 standard. The coarray feature combines semantics both from MPI and OpenMP, in the form that workers are independent both in execution and in data, but upon request data can be tagged as mutually addressable. Such addressing implicitly involves communication.
 4.
Multithreading. This is typically an operatingsystem feature which can be accessed by application programs. Distinct threads are independent, and communication has to be managed by the operating system and kernel.
4.4 Random numbers and parallelization
Whizard uses pseudo random numbers to generate events. Most random number generators have in common that they compute a reproducible stream of uniformly distributed random numbers \(\{x_i\} \in (0, 1)\) from a given starting point (seed) and they have a relative large periodicity. In addition, the generated random numbers should not have any common structures or global correlations. To ensure these prerequisites different test suites exist based on statistical principles and other methods. One is the TestU01 library implemented in ANSI C which contains several tests for empirical randomness [56]. A very extensive collection of tests is the Die Hard suite [57], also known as Die Hard 1, which contains e.g. the squeeze test, the overlapping sums test, the parking lot test, the craps test, and the runs test [58]. There is also a more modern version of this test suite, Die Harder or Die Hard 2 [59] which contains e.g. the Knuthran [60] and the Ranlux [61, 62] tests. Furthermore, the computation of the pseudo random numbers should add as less as possible computation time.
In order to utilize the TAO generator for a parallelized application we have to either communicate each random number before or during sampling, both are expensive on time, or we have to prepare or, at least, guarantee independent streams of random numbers from different instances of TAO by initializing each sequence with different seeds. The latter is hardly feasible or even impossible to ensure for all combinations of seeds and number of workers. This and the (time)restricted integer arithmetic render the TAO random number generator impractical for our parallelization task.
The overall sequence of random numbers is divided into streams of length \(2^{127}\), each of these streams is then further subdivided into substreams of length \(2^{76}\). Each stream or subsequent substream can be accessed by repeated application of the transition function \(x_{n} = T(x_{n  1})\). We rewrite the transition function as a matrix multiplication on a vector, making the linear dependence clear, \(x_{n} = T \times x_{n  1}\). Using the power of modular arithmetic, the repeated application of the transition function can be precomputed and be stored making access of the (sub)streams as simple as sampling one. In the context of the parallel evaluation of the random number generator we can get either independent streams of random numbers for each worker, or, conserving the numerical properties for the integration process, assign each channel a stream and each stratification cell of the integration grid a substream in the serial and parallel run. Then we can easily distribute the workers among channels and cells without further concern about the random numbers.
The original implementation of the RNGstream was in C++ using floating point arithmetic. We have rewritten the implementation for Fortran2008 in Whizard.
4.5 Parallel evaluation in WHIZARD
To devise a strategy for parallel evaluation, we have analyzed the workflow and the scaling laws for different parts of the code, as described above. Complex multiparticle processes are the prime target of efficient evaluation. In general, such processes involve a large number of integration dimensions, a significant number of quantumnumber configurations to be summed over, a large number of phasespace points per iteration of the integration procedure, and a large number of phasespace channels. By contrast, for a single phasespace channel the number of phasespace points remains moderate.
After the integration passes are completed, event generation in the simulation pass is another candidate for parallel execution. Again, a large number of phasespace points have to be sampled within the same computational model as during integration. Out of the generated sample of partonic events, in the unweighted mode, only a small fraction is further processed. The subsequent steps of parton shower, hadronization, decays, and file output come with their own issues of computing (in)efficiency.
We address the potential for parallel evaluation by two independent protocols, OpenMP and MPI. Both frameworks may be switched on or off independent of each other.
4.5.1 Sampling with OpenMP
On a lowlevel scale, we have implemented OpenMP as a protocol for parallel evaluation. The OpenMP paradigm is intended to distribute workers among the physical core of a single computing node, where actual memory is shared between cores. While in principle, the number of workers can be set freely by the user of the code, one does expect improvements as long as the number of workers is less or equal to the number of physical cores. The number of OpenMP workers therefore is typically between 1 and 8 for standard hardware, and can be somewhat larger for specialized hardware.
 1.
The loop over helicities in the treelevel matrixelement code that is generated by O’Mega. For a typical \(2\rightarrow 6\) fermion process, the number of helicity combinations is \(2^8=256\) and thus fits the expected number of OpenMP workers. We do not parallelize the sum of the flavor or color quantum numbers. In the current model of O’Mega code, those sums are subject to commonsubexpression elimination which inhibits trivial parallelization.
 2.
The loop over channels in the inverse mapping between phasespace parameters and momenta. Due to the large number of channels, the benefit is obvious, while the communication is minimal, and in any case is not a problem in a sharedmemory setup.
 3.
Analogously, the loop over channels in the discrete inverse mapping of the phasespace parameters within the Vamp algorithm.
4.5.2 Sampling with MPI
The MPI protocol is designed for computing clusters. We will give a short introduction into the terminology and the development of its different standards over time in the next subsection. The MPI model assumes that memory is local to each node, so data have to be explicitly transferred between nodes if sharing is required. The number of nodes can become very large. In practice, for a given problem, the degree of parallelization and the amount of communication limits the number of nodes where the approach is still practical. For Whizard, we apply the MPI protocol at a more coarsegrained level than OpenMP, namely the loop over sampling points which is controlled by the Vamp algorithm.
As discussed above, in general, for standard multiparticle problems the number of phasespace channels is rather large, typically exceeding \(10^3 \cdots 10^4\). In that case, we assign one separate channel or one subset of channels to each worker. In some calculations, the matrix element is computingintensive but the number of phasespace channels is small (e.g. NNLO virtual matrix elements), so this model does not apply. In that case, we parallelize over single grids. We assign to each worker a separate slice of the \(n_{r}^{d}\) cells of the stratification space. In principle, for the simplest case of \(n_{r} = 2\), we can exploit up to \(2^{d}\) computing nodes for a single grid. On the other hand, parallelization over the rspace is only meaningful when \(n_{r} \ge 2\). Especially, when we take into account that \(n_{r}\) changes between different iterations as the number of calls \(N_C\) depends on the multichannel weights \(\alpha _i\). Hence, we implement a sort of autobalancing in the form that we choose between the two modes of parallelization before and during sampling in order to handle the different scenarios accordingly. Per default, we parallelize over phasespace multichannel, but prefer singlegrid parallelization for the case that the number of cells in rspace is \(n_{r} > 1\). Because the singlegrid parallelization is finer grained than the phasespace channel parallelization, this allows in principle to exploit more workers. Furthermore, we note that the Monte Carlo integration itself does not exhibit any boundary conditions demanding communication during sampling (except when we impose such a condition by hand). In particular, there is no need to communicate random numbers. We discuss the details of the implementation later on.
4.5.3 The messagepassing interface (MPI) standard
We give a short introduction into the terminology necessary to describe our implementation below, and also into the messagepassing interface (MPI) standard. The MPI standard specifies a large amount of procedures, types, memory and process management and handlers, for different purposes. The wide range of functionality obscures a clear view on the problem of parallelization and on the other hand it unnecessarily complicates the problem itself. So, we limit the use of functionality to an absolute minimum. E.g., we do not make use of the MPI sharedmemory model and, for the time being, the use of an own process management for a serverclient model. In the following we introduce the most common terms. In the implementation details below, we again refer to the MPI processes as workers in order to not confuse them with the Whizard ’s physical processes.
The standard specifies MPI programs to consist of autonomous processes, each in principle running its own code, in an MIMD^{3} style, cf. [64, p. 20]. In order to abstract the underlying hardware and to allow separate communication between different program parts or libraries, the standard generalizes as communicators processes apart from the underlying hardware and introduces communication contexts. Contextbased communication secures that different messages are always received and sent within their context, and do not interfere with messages in other contexts. Inside communicators, processes are grouped (process group) as ordered sets with assigned ranks (labels) \({0, \ldots , n 1}\). The predefined MPI_COMM_WORLD communicator contains all processes known at the initialization of a MPI program in a linearly ordered fashion. In most cases, the linear order does not reflect the architecture of the underlying hardware and network infrastructure, therefore, the standard defines the possibility to map the processes onto the hardware and network infrastructure to optimize the usage of resources and increase the speedup.
A way to conceivably optimize the parallelization via MPI is to make the MPI framework aware about the communication flows in your application. In the group of processes in a communicator, not all processes will communicate with every other process. The network of interprocess communication is known as MPI topologies. The default is MPI_UNDEFINED where no specific topology has been specified, while MPI_CART is a Cartesian (nearestneighbor) topology. Special topologies can be defined with MPI_GRAPH. In this paper we only focus on the MPI parallelization of the Monte Carlo Vamp. A specific profiling of run times of our MPI parallelization could reveal specific topological structures in the communication which might offer potential for improvement of speedups. This, however, is beyond the scope of this paper.
 nonblocking
A nonblocking procedure returns directly after initiating a communication process. The status of communication must be checked by the user.
 blocking
A blocking procedure returns after the communication process has completed.
 pointtopoint
A pointtopoint procedure communicates between a single receiver and single sender.
 collective
A collective procedure communicates with the complete process group. Collective procedures must appear in the same order on all processes.
In order to ease the startup of a parallel application, the standard specifies the startup command mpiexec. However, we recommend the defacto standard startup command mpirun which is in general part of a MPIlibrary. In this way, the user does not have to bother with the quirks of the overall process management and interoperability with the operating system, as this is then covered by mpirun. Furthermore, most MPIlibraries support interfaces to cluster software, e.g. SLURM, Torque, HTCondor.
In summary, we do not use process and (shared) memory management, topologies and the advanced error handling of MPI, which we postpone to a future work.
4.5.4 Implementation details of the MPI parallelization
In this subsection, we give a short overview of the technical details of the implementation to show and explain how the algorithm works in detail.
In order to minimize the communication time \(T_{\mathrm {c}}\), we only communicate runtime data which cannot be provided a priori by Whizard ’s preintegration setup through the interfaces of the integrators. Furthermore, we expect that the workers are running identical code up to different communicationbased code branches. The overall worker setup is externally coordinated by the MPIlibrary provided process manager mpirun.
In order to enable file input/output (I/O), in particular to allow the setup of a process, without user intervention, we implement the wellknown masterslave model. The master worker, specified by rank 0, is allowed to setup the matrixelement library and to provide the phasespace configuration (or to write the grid files of Vamp) as those are involved with heavy I/O operations. The other workers function solely as slave workers supporting only integration and event generation. Therefore, the slave workers have to wait during the setup phase of the master worker. We implement this dependence via a blocking call to MPI_BCAST for all slaves while the master is going through the setup steps. As soon as the master worker has finished the setup, the master starts to broadcast a simple logical which completes the blocked communication call of the slaves allowing the execution of the program to proceed. The slaves are then allowed to load the matrixelement library and read the phasespace configuration file in parallel. The slave setup adds a major contribution to the serial time, mainly out of our control as the limitation of the parallel setup of the slave workers are imposed by the underlying filesystem and/or operating system, since all the workers try to read the files simultaneously. We expect that the serial time is increased at least by the configuration time of Whizard without building and making the matrixelement library and configuring the phase space. Therefore, we expect the configuration time at least to increase linearly with the number of workers.
When possible, we let objects directly communicate by Fortran2008 typebound procedures, e.g. the main Vegas grid object, vegas_grid_t has vegas_grid _broadcast as shown in Listing 3. The latter broadcasts all relevant grid information which is not provided by the API of the integrator. We have to send the number of bins to all processes before the actual grid binning happens, as the size of the grid array is larger than the actual number of bins requested by Vegas.^{5}
Further important explicit implementations are the two combinations of typebound procedures vegas_send _distribution for sending and for receiving vegas_receive_distribution, and furthermore vegas_result_send/vegas_result_receive which are needed for the communication steps involved in Vamp in order to keep the Vegas integrator objects encapsulated (i.e. preserve their private attribute).
Beyond the inclusion of nonblocking collective communication we choose as a minimum prerequisite the major version 3 of MPI for better interoperability with Fortran and its conformity to the Fortran2008 + TS19113 (and later) standard [64, Sec. 17.1.6]. This, e.g., allows for MPIderived type comparison as well as asynchronous support (for I/O or nonblocking communication).
A final note on the motivation for the usage of nonblocking procedures. Classic (i.e. serial) Monte Carlo integration exhibits no need for insampling communication in contrast to classic application of parallelization, e.g. solving partial differential equations. For the time being, we still use nonblocking procedures in Vegas for future optimization, but in a more or less blocking fashion, as most nonblocking procedures are followed by a MPI_WAIT procedure. However, the multichannel ansatz adds sufficient complexity, as each channel itself is an independent Monte Carlo integration. A typical use case is the collecting of already sampled channels while still sampling the remaining channels as it involves the largest data transfers in the parallelization setup. Here, we could benefit most from nonblocking communication. To implement these procedures as nonblocking necessitates a further refactoring of the present multichannel integration of Whizard, because in that case the master worker must not perform any kind of calculation but should only coordinate communication. A further constraint to demonstrate the impact of turning many of our still blocking communication into a nonblocking one is the fact that at the current moment, there do not exist any profilers compliant with the MPI3.1 status that support Fortran2008. Therefore, we have to postpone the opportunity to show the possibility of completely nonblocking communication in our setup.
4.5.5 Speedup and results
In order to assess the efficiency of our parallelization, we compare the two modes, the traditional serial Vamp implementation and our new parallelized implementation. We restrict ourselves to measuring the efficiency of the parallel integration, which, in contrast to parallel event generation, requires nontrivial communication. For the latter, we limited our efforts to extending capabilities of the event generation to use the RNGstream algorithm in order to secure independent random numbers among the workers and automatically split the output of events into multiple files for each worker, respectively, as we do not make use of parallel I/O. In this version, the event generation does not require any kind of communication and, therefore, we postpone a detailed discussion of the efficiency of parallel event generation, as such a discussion would too much depend upon environmental factors of the used cluster like the speed of the file system etc.
We benchmark the processes on the high performance cluster of the University of Siegen (Hochleistungsrechner Universität Siegen, HorUS) which provides 34 Dell PowerEdge C6100 each containing 4 computing nodes with 2 CPUs. The nodes are connected by gigabit ethernet and the CPUs are Intel Xeon X5650 with 6 cores each \(2.7\,\hbox {GHz}\) and \(128\,\hbox {MiB}\) cache. We employ two different Whizard builds, a first one only with OpenMPI 2.1.2, and a second one with additional OpenMP support testing the hybrid parallelization. The HorUS cluster utilizes SLURM 17.02.2 as batch and allocation system allowing for easy source distribution. We run Whizard using the MPIprovided run manager mpirun and measure the runtime with the GNU time command tool, and we average over three independent runs for the final result. We measure the overall computation time of a Whizard run including the complete process setup with matrixelement generation and phasespace configuration. It is expected that the setup step gives rise to the major part of the serial computation of Whizard, and also the I/O operations of the multichannel integrator, which saves the grids after each integration iteration. As this is a quasistandard, we benchmark over \(N_{\text {CPU}}\) in powers of 2. Given the architecture of the HorUS cluster with its double hexcores, benchmarking in powers of 6 would maybe be more appropriate for the MPIonly measurements. We apply a nodespecific setup for the measurement of the hybrid parallelization. Each CPU can handle up to six threads without any substantial throttling. We operate over \(N_{\text {Threads}} = \{1, 2, 3, 6\}\) with either fixed overall number of involved cores, \(N_{\text {Worker}} = \{60, 30, 20, 10\}\), with results shown in Fig. 4, or with fixed number of workers \(N_{\text {Worker}} = 20\), with results shown in Fig. 5.
One final comment on the usage of parallelized phasespace integration for higherorder processes. Clearly, multileg processes at LO for high multiplicities is still one of the challenges in MonteCarlo simulations, but the stateoftheart are nowadays automatized packages that allow for NLO simulations, with an automatic setup to match the fixedorder results to parton showers (NLO+PS). There are also tools which can do specialized processes at NNLO (e.g. [6, 7]). The major bottlenecks for these are processdependent: for some NNLO and specifically NLO EW multiscale processes, virtual matrix elements exist only numerically and need several seconds per phasespace points, while in many other cases the real corrections (or doublereal corrections in case of NNLO) are the most computingintensive parts of the calculation due to the higher phasespace dimensionality and the singularsubtracted structures (e.g. the high number of subtraction terms for multileg processes). In Ref. [46] the MPI parallelization presented here was used for the first time in a physics study for likesign WW scattering at the LHC at LO and LO + PS. Whizard ’s automation of NLO QCD corrections has not yet been completely finalized, particularly the optimization of NLO processes, so we decided not to discuss benchmark NLO processes in this paper. But in our validation, the MPIparallelized integration presented here already plays an important role, as it reduces times for adaptive integrations by more than an order of magnitude when using order a hundred cores. Clearly, the complexity of multijet processes and of virtual multileg matrix elements is a major motivation for the development of parallelized phasespace integration. Again, event simulation can be trivially parallelized, but validations, scale variations and other integrationintensive NLO projects do benefit enormously from the parallelized integration algorithm presented here. We will report on more detailed benchmarks for NLO processes in an upcoming publication [66].
4.6 Alternative algorithm for phasespace generation
Profiling of the code reveals that for the moment the main bottleneck that inhibits speedups beyond \(n=100\) is the initial construction of the phasespace configurations, i.e. the phasespace channels and their parameterizations (the determination of the best mappings to be done for each channel or class of channels) which Whizard constructs from forests of Feynman treegraphs. Resembling the language of Ref. [67], this construction algorithm is called wood. This wood algorithm takes into account the model structure, namely the threepoint vertices to find resonant propagators, the actual mass values (to find collinear and soft singularities and to map mass edges), and the process energy. It turns out that while the default algorithm used in Whizard yields a good sample of phasespace channels to enable adaptive optimization, it has originally not been programmed in an efficient way. Though it is a recursive algorithm, it does not work on directed acyclical graphs (DAGs) like O’Mega to avoid all possible redundancies.
Therefore, a new algorithm, wood2, has been designed in order to overcome this problem. Instead of constructing the forest of parameterizations from the model, it makes use of the fact that the matrix elements constructed optimally by O’Mega in form of a DAG already contain all the necessary information with the exception of the numerical values for masses and collider energy. Thus, instead of building up the forest again, the algorithm takes a suitable description of the set of trees from O’Mega and applies the elimination and simplification algorithm, in order to yield only the most relevant trees as phasespace channels. As it turns out, even in a purely serial mode, the new implementation performs much better and thus eliminates the most important source of saturation in speedup. Another benefit of the new algorithm is that it is much less memoryhungry than the original one which could have become a bottleneck for the traditional algorithm for complicated processes in very complicated models (e.g. extensions of the MSSM).
5 Conclusions and outlook
MonteCarlo simulations of elementary processes are an essential prerequisite for successful physics analyses at present and future colliders. Highmultiplicity processes and high precision in signal and background detection put increasing demands on the required computing resources. One particular bottleneck is the multidimensional phasespace integration and the task of automatically determining a most efficient sampling for (unweighted) event generation. In this paper, we have described an efficient algorithm that employs automatic iterative adaptation in conjunction with parallel evaluation of the numerical integration and event generation.
The parallel evaluation is based on the paradigm of the message passing interface protocol (MPI) in conjunction with OpenMP multithreading. For the concrete realization, the algorithm has been implemented within the framework of the multipurpose event generator Whizard. The parallelization support for MPI or OpenMP can be selected during the configure step of Whizard. The new code constitutes a replacement module for the Vamp adaptive multichannel integrator which makes active use of modern features in the current MPI3.1 standard. Our initial tests for a variety of benchmark physics processes demonstrate a speedup by a factor \(>10\) with respect to serial evaluation. The best results have been achieved by MPI parallelization. The new implementation has been incorporated in the release version Whizard v2.6.4.
We were able to show that, in general, hybrid parallelization with OpenMP and MPI leads to a speedup which is comparable to MPI parallelization alone. However, combining both approaches is beneficial for tackling memoryintense processes, such as 8 or 10particle processes. Depending on a particular computingcluster topology, the latter approach can allow for a more efficient use of the memory locally available at a computing node. In the hybrid approach, Whizard is parallelized on individual multicore nodes via OpenMP multithreading, while distinct computing nodes communicate with each other via MPI. The setup of the system allows for sufficient flexibility to make optimal use of both approaches for a specific problem.
The initial tests point to further possibilities for improvement, which we foresee for future development and refinements of the implementation. A server/client structure should give the freedom to reallocate and assign workers dynamically during a computing task, and thus make a more efficient use of the available resources. Further speedup can be expected from removing various remaining blocking communications and replacing them by nonblocking communication while preserving the integrity of the calculation. Finally, we note that the algorithm shows its potential for calculations that spend a lot of time in matrixelement evaluation. For instance, in some tests of NLO QCD processes we found that the time required for integration could be reduced from the order of a week down to a few hours. We defer a detailed benchmarking of such NLO processes to a future publication.
Footnotes
 1.
The number of free parameters is given by the number of bins per axis \(n_k\), restricted by the condition \(\sum _{j_k} \varDelta x_{kj_k} = 1\).
 2.
There would of course be the possibility to have a special version of Whizard only available for the newest version(s) of compilers to test those features. This is part of a future project.
 3.
Multiple instructions, multiple data. Machines supporting MIMD have a number of processes running asynchronously and independently.
 4.
The grid type holds information on the binning \(x_i\), the number of dimensions, the integration boundaries and the jacobian.
 5.
The size of the grid array is set to a predefined or userdefined value. If only the implementation switches to stratified sampling, the number of bins is adjusted to the number of boxes/cells and, hence, does not necessarily match the size of the grid array.
Notes
Acknowledgements
The authors want to thank Bijan Chokoufé Nejad, Stefan Hoeche and Thorsten Ohl for helpful and interesting discussions. For their contributions to the new phasespace construction algorithm (known as wood2) we give special credits to Manuel Utsch and Thorsten Ohl.
References
 1.R. Kleiss, W. James Stirling, S.D. Ellis, A new Monte Carlo treatment of multiparticle phase space at highenergies. Comput. Phys. Commun. 40, 359 (1986). https://doi.org/10.1016/00104655(86)901190 ADSCrossRefGoogle Scholar
 2.S. Plätzer, RAMBO on diet (2013). arXiv:1308.2922 [hepph]
 3.T. Gleisberg et al., Event generation with SHERPA 1.1. JHEP 02, 007 (2009). https://doi.org/10.1088/11266708/2009/02/007. arXiv:0811.4622 [hepph]ADSCrossRefGoogle Scholar
 4.J. Alwall et al., The automated computation of treelevel and nexttoleading order differential cross sections, and their matching to parton shower simulations. JHEP 07, 079 (2014). https://doi.org/10.1007/JHEP07(2014)079. arXiv:1405.0301 [hepph]ADSCrossRefGoogle Scholar
 5.Physics Event Generator Computing Workshop, CERN, 26–28 Nov 2018. https://indico.cern.ch/event/751693/
 6.M. Grazzini, S. Kallweit, M. Wiesemann, Fully differential NNLO computations with MATRIX. Eur. Phys. J. C 78(7), 537 (2018). https://doi.org/10.1140/epjc/s1005201857717. arXiv:1711.06631 [hepph]ADSCrossRefGoogle Scholar
 7.J.M. Campbell, R.K. Ellis, W.T. Giele, A multithreaded version of MCFM. Eur. Phys. J. C 75(6), 246 (2015). https://doi.org/10.1140/epjc/s1005201534612. arXiv:1503.06182 [physics.compph]ADSCrossRefGoogle Scholar
 8.W. Kilian, T. Ohl, J. Reuter, WHIZARD: simulating multiparticle processes at LHC and ILC. Eur. Phys. J. C 71, 1742 (2011). https://doi.org/10.1140/epjc/s100520111742y. arXiv:0708.4233 [hepph]ADSCrossRefGoogle Scholar
 9.N.D. Christensen et al., Introducing an interface between WHIZARD and FeynRules. Eur. Phys. J. C 72, 1990 (2012). https://doi.org/10.1140/epjc/s1005201219905. arXiv:1010.3251 [hepph]ADSCrossRefGoogle Scholar
 10.M. Moretti, T. Ohl, J. Reuter, O’Mega: An optimizing matrix element generator. 1981–2009 (2001). arXiv:hepph/0102195
 11.T. Ohl, J. Reuter, Clockwork SUSY: supersymmetric ward and Slavnov–Taylor identities at work in Green’s functions and scattering amplitudes. Eur. Phys. J. C 30, 525–536 (2003). https://doi.org/10.1140/epjc/s2003013017. arXiv:hepth/0212224 ADSCrossRefzbMATHGoogle Scholar
 12.T. Ohl, J. Reuter, Testing the noncommutative standard model at a future photon collider. Phys. Rev. D 70, 076007 (2004). https://doi.org/10.1103/PhysRevD.70.076007. arXiv:hepph/0406098 ADSCrossRefGoogle Scholar
 13.K. Hagiwara et al., Supersymmetry simulations with offshell effects for CERN LHC and ILC. Phys. Rev. D 73, 055005 (2006). https://doi.org/10.1103/PhysRevD.73.055005. arXiv:hepph/0512260 ADSCrossRefGoogle Scholar
 14.W. Kilian et al., QCD in the colorflow representation. JHEP 10, 022 (2012). https://doi.org/10.1007/JHEP10(2012)022. arXiv:1206.3700 [hepph]ADSCrossRefGoogle Scholar
 15.B.C. Nejad, T. Ohl, J. Reuter, Simple, parallel virtual machines for extreme computations. Comput. Phys. Commun. 196, 58–69 (2015). https://doi.org/10.1016/j.cpc.2015.05.015 ADSCrossRefGoogle Scholar
 16.W. Kilian et al., An analytic initialstate parton shower. JHEP 04, 013 (2012). https://doi.org/10.1007/JHEP04(2012)013. arXiv:1112.1039 [hepph]ADSCrossRefGoogle Scholar
 17.T. Sjöstrand, S. Mrenna, P.Z. Skands, PYTHIA 6.4 physics and manual. JHEP 05, 026 (2006). https://doi.org/10.1088/11266708/2006/05/026. arXiv:hepph/0603175 ADSCrossRefzbMATHGoogle Scholar
 18.T. Sjöstrand et al., An introduction to PYTHIA 8.2. Comput. Phys. Commun 191, 159–177 (2015). https://doi.org/10.1016/j.cpc.2015.01.024. arXiv:1410.3012 [hepph]ADSCrossRefzbMATHGoogle Scholar
 19.F. Cascioli, P. Maierhofer, S. Pozzorini, Scattering amplitudes with open loops. Phys. Rev. Lett. 108, 111601 (2012). https://doi.org/10.1103/PhysRevLett.108.111601. arXiv:1111.5206 [hepph]ADSCrossRefGoogle Scholar
 20.F. Buccioni, S. Pozzorini, M. Zoller, Onthefly reduction of open loops. Eur. Phys. J. C 78(1), 70 (2018). https://doi.org/10.1140/epjc/s1005201855621. arXiv:1710.11452 [hepph]ADSCrossRefGoogle Scholar
 21.G. Cullen et al., Automated oneloop calculations with GoSam. Eur. Phys. J. C 72, 1889 (2012). https://doi.org/10.1140/epjc/s1005201218891. arXiv:1111.2034 [hepph]ADSCrossRefGoogle Scholar
 22.G. Cullen et al., GOSAM2.0: a tool for automated oneloop calculations within the Standard Model and beyond. Eur. Phys. J. C 74(8), 3001 (2014). https://doi.org/10.1140/epjc/s1005201430015. arXiv:1404.7096 [hepph]ADSCrossRefGoogle Scholar
 23.S. Actis et al., RECOLA: recursive computation of oneloop amplitudes. Comput. Phys. Commun. 214, 140–173 (2017). https://doi.org/10.1016/j.cpc.2017.01.004. arXiv:1605.01090 [hepph]ADSCrossRefzbMATHGoogle Scholar
 24.S. Frixione, Z. Kunszt, A. Signer, Three jet crosssections to nexttoleading order. Nucl. Phys. B 467, 399–442 (1996). https://doi.org/10.1016/05503213(96)001101. arXiv:hepph/9512328 ADSCrossRefGoogle Scholar
 25.R. Frederix et al., Automation of nexttoleading order computations in QCD: the FKS subtraction. J. High Energy Phys. 2009(10), 003–003 (2009). https://doi.org/10.1088/11266708/2009/10/003 CrossRefGoogle Scholar
 26.T. Ježo, P. Nason, On the treatment of resonances in nexttoleading order calculations matched to a parton shower. JHEP 12, 065 (2015). https://doi.org/10.1007/JHEP12(2015)065. arXiv:1509.09071 [hepph]ADSCrossRefGoogle Scholar
 27.J. Reuter et al., Automation of NLO processes and decays and POWHEG matching in WHIZARD. J. Phys. Conf. Ser. 762, 012059 (2016). https://doi.org/10.1088/17426596/762/1/012059. (issn:17426596) CrossRefGoogle Scholar
 28.W. Kilian, J. Reuter, T. Robens, NLO event generation for chargino production at the ILC. Eur. Phys. J. C 48, 389–400 (2006). https://doi.org/10.1140/epjc/s100520060048y. arXiv:hepph/0607127 ADSCrossRefGoogle Scholar
 29.T. Binoth et al., Nexttoleading order QCD corrections to pp > b antib b antib + X at the LHC: the quark induced case. Phys. Lett. B 685, 293–296 (2010). https://doi.org/10.1016/j.physletb.2010.02.010. arXiv:0910.4379 [hepph]ADSCrossRefGoogle Scholar
 30.N. Greiner et al., NLO QCD corrections to the production of two bottomantibottom pairs at the LHC. Phys. Rev. Lett. 107, 102002 (2011). https://doi.org/10.1103/PhysRevLett.107.102002. arXiv:1105.3624 [hepph]ADSCrossRefGoogle Scholar
 31.B.C. Nejad et al., NLO QCD predictions for offshell tt and ttH production and decay at a linear collider. JHEP 12, 075 (2016). https://doi.org/10.1007/JHEP12(2016)075. arXiv:1609.03390 [hepph]ADSCrossRefGoogle Scholar
 32.F. Bach et al., Fullydifferential toppair production at a lepton collider: from threshold to continuum. JHEP 03, 184 (2018). https://doi.org/10.1007/JHEP03(2018)184. arXiv:1712.02220 [hepph]CrossRefGoogle Scholar
 33.G.P. Lepage, A new algorithm for adaptive multidimensional integration. J. Comput. Phys. 27(2), 192–203 (1978). https://doi.org/10.1016/00219991(78)900049 ADSCrossRefzbMATHGoogle Scholar
 34.G.P. Lepage, VEGAS—an adaptive multidimensional integration program. Tech. rep. CLNS447. Cornell Univ. Lab. Nucl. Stud., Ithaca (1980). http://cds.cern.ch/record/123074
 35.R. Kleiss, R. Pittau, Weight optimization in multichannel Monte Carlo. Comput. Phys. Commun. 83(2–3), 141–146 (1994). https://doi.org/10.1016/00104655(94)900434. (issn:00104655) ADSCrossRefGoogle Scholar
 36.T. Ohl, Vegas revisited: adaptive Monte Carlo integration beyond factorization. Comput. Phys. Commun. 120(1), 13–19 (1999). https://doi.org/10.1016/s00104655(99)00209x ADSCrossRefzbMATHGoogle Scholar
 37.F. James, Monte Carlo theory and practice. Rep. Prog. Phys. 43(9), 1145 (1980)ADSCrossRefGoogle Scholar
 38.P.D. Draggiotis, A. van Hameren, R. Kleiss, SARGE: an algorithm for generating QCD antennas. Phys. Lett. B 483, 124–130 (2000). https://doi.org/10.1016/S03702693(00)005323. arXiv:hepph/0004047 ADSCrossRefGoogle Scholar
 39.A. van Hameren, R. Kleiss, Generating QCD antennas. Eur. Phys. J. C 17, 611–621 (2000). https://doi.org/10.1007/s100520000508. arXiv:hepph/0008068 ADSCrossRefzbMATHGoogle Scholar
 40.B. Gough, GNU Scientific Library Reference Manual (Network Theory Ltd., Boston, 2009)Google Scholar
 41.M. Beyer et al., Determination of new electroweak parameters at the ILC—sensitivity to new physics. Eur. Phys. J. C 48, 353–388 (2006). https://doi.org/10.1140/epjc/s1005200600380. arXiv:hepph/0604048 ADSCrossRefGoogle Scholar
 42.A. Alboteanu, W. Kilian, J. Reuter, Resonances and unitarity in weak boson scattering at the LHC. JHEP 11, 010 (2008). https://doi.org/10.1088/11266708/2008/11/010. arXiv:0806.4145 [hepph]ADSCrossRefGoogle Scholar
 43.W. Kilian et al., Highenergy vector boson scattering after the Higgs discovery. Phys. Rev. D 91, 096007 (2015). https://doi.org/10.1103/PhysRevD.91.096007. arXiv:1408.6207 [hepph]ADSCrossRefGoogle Scholar
 44.W. Kilian et al., Resonances at the LHC beyond the Higgs boson: the scalar/tensor case. Phys. Rev. D 93(3), 036004 (2016). https://doi.org/10.1103/PhysRevD.93.036004. arXiv:1511.00022 [hepph]ADSCrossRefGoogle Scholar
 45.C. Fleper et al., Scattering of W and Z bosons at highenergy lepton colliders. Eur. Phys. J. C 77(2), 120 (2017). https://doi.org/10.1140/epjc/s1005201746565. arXiv:1607.03030 [hepph]ADSCrossRefGoogle Scholar
 46.A. Ballestrero et al., Precise predictions for samesign Wboson scattering at the LHC. Eur. Phys. J. C 78(8), 671 (2018). https://doi.org/10.1140/epjc/s100520186136y. arXiv:1803.07943 [hepph]ADSCrossRefGoogle Scholar
 47.S. Brass et al., Transversal modes and Higgs bosons in electroweak vectorboson scattering at the LHC. Eur. Phys. J. C 78(11), 931 (2018). https://doi.org/10.1140/epjc/s1005201863984. arXiv:1807.02512 [hepph]ADSCrossRefGoogle Scholar
 48.J. Reuter, D. Wiesler, Distorted mass edges at LHC from supersymmetric leptoquarks. Phys. Rev. D 84, 015012 (2011). https://doi.org/10.1103/PhysRevD.84.015012. arXiv:1010.4215 [hepph]ADSCrossRefGoogle Scholar
 49.N. Pietsch et al., Extracting gluino endpoints with event topology patterns. JHEP 07, 148 (2012). https://doi.org/10.1007/JHEP07(2012)148. arXiv:1206.2146 [hepph]ADSCrossRefGoogle Scholar
 50.J. Reuter, D. Wiesler, A fat gluino in disguise. Eur. Phys. J. C 73(3), 2355 (2013). https://doi.org/10.1140/epjc/s1005201323554. arXiv:1212.5559 [hepph]ADSCrossRefGoogle Scholar
 51.G.M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in Proceedings of the April 18–20, 1967, Spring Joint Computer Conference. AFIPS ’67 (Spring). (ACM, Atlantic City, 1967), pp. 483–485. https://doi.org/10.1145/1465482.1465560
 52.J.L. Gustafson, Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988). https://doi.org/10.1145/42411.42415. (issn: 00010782) CrossRefGoogle Scholar
 53.W. Gropp et al., Using MPI & Using MPI2 (MIT Press, Cambridge, 2014). (isbn: 9780262571340) Google Scholar
 54.R. Chandra et al., Parallel Programming in OpenMP (Morgan Kaufmann Publishers Inc., San Francisco, 2001). (isbn: 1558606718, 9781558606715) Google Scholar
 55.R.W. Numrich, Parallel Programming with CoArray Fortran (CRC Press/Taylor & Francis, Boca Raton, 2018). (isbn: 9781439840047) CrossRefGoogle Scholar
 56.P. L’Ecuyer, R. Simard, Testu01: a C library for empirical testing of random number generators. ACM Trans. Math. Softw. 33(4), 22:1–22:40 (2007). https://doi.org/10.1145/1268776.1268777. (issn: 00983500) MathSciNetCrossRefzbMATHGoogle Scholar
 57.G. Marsaglia, The Marsaglia Random Number CDROM including the Diehard Battery of Tests of Randomness (1995). https://web.archive.org/web/20160125103112/. http://stat.fsu.edu/pub/diehard/
 58.A. Wald, J. Wolfowitz, On a test whether two samples are from the same population. Ann. Math. Stat. 11(2), 147–162 (1940). https://doi.org/10.1214/aoms/1177731909 MathSciNetCrossRefzbMATHGoogle Scholar
 59.R.G. Brown, D. Eddelbüttel, D. Bauer, (2018). http://webhome.phy.duke.edu/~rgb/General/dieharder.php
 60.D.E. Knuth, The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms (AddisonWesley Longman Publishing Co., Inc., Boston, 1997) (isbn: 0201896842) Google Scholar
 61.M. Luscher, A portable high quality random number generator for lattice field theory simulations. Comput. Phys. Commun. 79, 100–110 (1994). https://doi.org/10.1016/00104655(94)902321. arXiv:heplat/9309020 ADSMathSciNetCrossRefzbMATHGoogle Scholar
 62.L.N. Shchur, P. Butera, The RANLUX generator: resonances in a random walk test. Int. J. Mod. Phys. C 9, 607–624 (1998). https://doi.org/10.1142/S0129183198000509. arXiv:heplat/9805017 ADSCrossRefGoogle Scholar
 63.P. L’Ecuyer et al., An objectoriented randomnumber package with many long streams and substreams. Oper. Res. 50(6), 1073–1075 (2002). https://doi.org/10.1287/opre.50.6.1073.358 CrossRefGoogle Scholar
 64.MessagePassing Interface Forum, MPI: A MessagePassing Interface Standard, Version 3.1 (High Performace Computing Center (HLRS), Stuttgart, 2015)Google Scholar
 65.R. Kreckel, Parallelization of adaptive MC integrators. Comput. Phys. Commun. 106(3), 258–266 (1997). https://doi.org/10.1016/s00104655(97)000994 ADSCrossRefzbMATHGoogle Scholar
 66.B.C. Nejad et al., work in progress (2019)Google Scholar
 67.E. Boos, T. Ohl, Minimal gauge invariant classes of tree diagrams in gauge theories. Phys. Rev. Lett. 83, 480–483 (1999). https://doi.org/10.1103/PhysRevLett.83.480. arXiv:hepph/9903357 ADSCrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Funded by SCOAP^{3}