Automated event generation for loop-induced processes

We present the first fully automated implementation of cross-section computation and event generation for loop-induced processes. This work is integrated in the MadGraph5_aMC@NLO framework. We describe the optimisations implemented at the level of the matrix element evaluation, phase space integration and event generation allowing for the simulation of large multiplicity loop-induced processes. Along with some selected differential observables, we illustrate our results with a table showing inclusive cross-sections for all loop-induced hadronic scattering processes with up to three final states in the SM as well as for some relevant two to four processes. Many of these are computed here for the first time.


Contents 1 Introduction
The first run of the Large Hadron Collider (LHC) further confirmed the Standard Model (SM) in a spectacular way with the discovery of what was its last missing piece: the Higgs boson [1][2][3][4][5]. This discovery testifies not only of the technical success of the accelerator and detectors, but also of the level of accuracy reached in the theoretical predictions intervening at many levels of a discovery or exclusion at a collider. One can identify three main types of effort at the origin of this precision reached by modern high energy physics simulations. First, dedicated analytical single-purposed computations keep pushing the boundaries of perturbative physics and are essential for realistic inclusive predictions (e.g. Higgs hadroproduction at N 3 LO [6]). Secondly, parton shower Monte-Carlo programs such as Herwig++ [7], Pythia8 [8] and Sherpa [9] improved their formal control on the soft physics resummation as well as on the merging techniques with hard matrix-element predictions. Finally, the last decade has seen the rise of a number of automated one-loop matrix element computation tools, such as MadLoop [10], FeynArts [11,12], OpenLoops [13] and GoSam [14]. Thanks to these progresses, one can obtain accurate normalisation and distributions for basically any process of interest at NLO accuracy, using flexible partonic hard event generators such as POWHEG [15][16][17], Sherpa and the framework in which the present work is carried, MadGraph5_aMC@NLO [18] (referred to as MG5aMC henceforth).
The aforementioned tools are crucial for both experimentalists and model builders to simulate backgrounds and explore new physics signals. However, despite these efforts, event generation for loop-induced processes, i.e. processes without any tree-level contributions and starting at one-loop, is still not available in a systematic and fully automated way. This is to be contrasted with the fact that loop-induced amplitudes for many relevant processes are already available in the main public one-loop matrix element providers [10][11][12][13][14]. The aim of this work is to remedy this situation by providing an efficient and generic technique for the simulation of loop-induced processes which play a significant role both in the SM and beyond. Loop-induced processes take their name from the fact that their leadingorder (LO) contribution comes already from loop amplitudes, so that loop integrals are unavoidable in this case, even for the crudest computation of the cross-section. Exceptions to this are certain processes where the loop featuring heavy particles can be integrated out to form an effective point-like vertex turning the loop topology into a tree one. However, this approximation is typically valid only in a limited kinematical range, so that the general implementation of the direct computation of loop-induced processes is desirable.
In the SM, a prime example is the gluon fusion channel for Higgs boson production which dominates the inclusive production, mainly because of the large gluon luminosity in high energy hadron colliders. Besides this obvious case, loop-induced channels can also amount to a significant part of some NNLO corrections; it was for example recently shown in ref. [19] that the contribution of the loop-induced gg → ZZ channel represents 60% of the full NNLO correction of hadronic Z-boson pair production. Loop-induced processes are also often relevant in the context of beyond the Standard Model (BSM) models where the associated loop suppression factor can help to evade current experimental bounds.
The attitude towards the computation of loop-induced processes has been so far that of developing single-purpose specific codes (see references in sect. 3). Given current loop technology, it is desirable to adopt a solution both generic in the process studied and flexible in its use. We present here the implementation of this solution within the public framework MG5aMC. Loop matrix-elements are computed with MadLoop, using a combination of the Ossola-Papadopoulos-Pittau (OPP) [20,21] reduction method as implemented in CutTools [22] and various Tensor Integral Reduction tools (TIR) such as PJFry [23], Golem95 [24,25] and the in-house implementation IREGI. The integration over phasespace and event generation is performed by MadEvent [26], the multi-purpose phase space integrator used for LO calculations in MG5aMC. The paper is organised as follows. Sect. 2 describes the various improvements brought to MadLoop and MadEvent to cope with loop-induced processes, for which the absence of underlying tree topologies renders inapplicable many optimisations. Sect. 3 presents a comprehensive list of cross-sections for SM loop-induced processes obtained within our implementation. In sect. 4, we focus on loop-induced Higgs production to illustrate how the various simulation features of MG5aMC apply in this case. A systematic comparison of our results with those obtained using the Effective Field Theory (EFT) approach shows where and to which extent the latter is a good approximation. Sect. 5 is devoted to the validation of our implementation within a BSM context against the independent computation of ref. [27] for loop-induced Z-Higgs associated production. We also show results for the loop-induced charged Higgs pair production. We summarise our work in sect. 6.

Loop computation, TIR implementation and colour decomposition
The main challenge for loop-induced processes integration is the running speed of the related loop-matrix element computation. In NLO computations, the bulk of the contribution comes from real-emission and Born topologies, so that only limited statistics is necessary for the evaluation of the virtual contribution (see sec. 2.4.3 of ref. [18]); it is then very often not the limiting factor in terms of computational load. The situation is radically different for the direct integration of loop-induced processes where the computation time is proportional to the execution speed of the loop matrix element.
To understand the characteristics of the various techniques available in MadLoop for the computation of loop-induced squared matrix elements, it is appropriate to start from its generic expression: where A LI designates the loop-induced amplitude, λ l i are colour structures, N h,l i are loop integrand numerators andD i,l j are d-dimensional propagator denominators of the form (¯ +k i,l j ) 2 −m 2 i,l j . The symbols h , colour and l i denote the sum over all helicity, colour configurations and loop subamplitudes factoring a single colour factor. 1 The integers H, L and n l i are the total number of helicity configurations, loop subamplitudes and loop propagator denominators in subamplitude l i respectively. Notice that no UV counterterms are necessary in this case given that loop-induced processes are finite. Conversely, R2 counterterms [28][29][30], originating from the 4-dimensional nature of the reduction methods applied in MadLoop, must be included. 2 The |A LI non-R 2 | 2 term of eq. 2.1 is the new element specific to loop-induced matrix element computations, and we now turn to detailing its implementation in MadLoop. The key characteristic of the quantity |A LI non-R 2 | 2 is that it is not linearly dependent on the loop amplitudes. This has dramatic implications on the type of optimisations applicable to its computation. When replacing the loop integral by the formal reduction operator Red[] (symbolically denoting the application of any one of the available reduction tools interfaced 1 These subamplitudes are in one-to-one correspondence with the amplitudes of the constituting Feynman diagrams, except for those involving vertices featuring mlultiple colour factors (e.g. four-gluon vertex). 2 Counterterms are selected and constructed with the same algorithm as for NLO virtual matrix element generation, i.e. directly starting from the loop Feynman diagrams considered, hence guaranteeing the consistency of the computation.
to MadLoop), one finds: Contrary to NLO computations, where loop amplitudes interfere with tree-level ones, it is clear that the Red[] operator cannot be pulled out of the sum over helicity configurations so as to apply at the squared matrix element level instead (see the transition from eq. 2.75 to eq. 2.76 of ref [18]). The expanded sum L l 1 =1 L l 2 =1 contains L 2 terms and this quadratic scaling with the number of diagrams is problematic when compared to the apparent 3 linear scaling of the total loop computation time. As a result, the computation time becomes dominated by the squaring operation for loop-induced processes with as few as a thousand diagrams. To circumvent this scaling, we consider here the same solution as the one adopted for tree-level computations, namely colour decomposition. Indeed, the origin of this problem can be traced back to the size of the colour basis, built out of a total of L basis vectors λ l which are not linearly independent. The solution consists then in projecting the colour factors λ l onto a colour-flow basis [31] built out of K colour-flow basis vectors κ i (chains of Kronecker delta structures with indices in the (anti-)fundamental representation of SU (3) c ). The growth of K with the multiplicity is power-like, hence guaranteeing that the computational cost of the colour algebra involved in the amplitude squaring operation remains negligible with respect to that of the loop amplitude computation. For example, the process gg → hggg has 3330 subamplitudes but only 24 different colour-flows. Each loop colour factor is then projected onto the colour-flow basis as follows: Notice that in practice, the projection matrix is sparse so that most of the projection coefficients α l,i are zero and the sum involves a few terms only. The corresponding colour matrix is Once expressed in the colour-flow basis, the loop-induced squared amplitude reads where J i,h is defined as follows: and corresponds to the partial colour amplitudes built from sums of the Lorentz part of the loop amplitudes weighted by projection coefficients. This change of colour basis is not only computationally advantageous, but also more physical since the partial colour amplitudes are gauge invariant. It also automatically solves the problem of assigning a colour flow to partonic events, which amounts to specifying the colour dipole pairs in the starting conditions of parton shower Monte-Carlo programs and thus has an important impact on the radiation pattern. This colour projection is performed automatically by MG5aMC, using the same colour algebra module employed for tree-level matrix element generation. However, contrary to the tree-level case, the projection coefficients α l,i can be of different orders in the colour expansion in N c because of the loop colour trace. This effect is accounted for when assigning colours to the generated events in which case only the leading term in the colour expansion is kept. Also, thanks to the flexibility of the colour module performing the algebra, arbitrary colour structures can be supported as well as the definition of any other basis, were that be necessary. The computation of the partial colour amplitudes has been made available also for the computation of the standard loop and Born interference term appearing in NLO computations (see appendix A.2).
The inability to perform loop reduction at the squared matrix element level leads to a crucial difference between the TIR and OPP reduction methods. To understand why this is so, we detail the expression taken by the Red[] operation in both cases. The integrand numerator is decomposed in a polynomial in the loop momentum as follows: where C (r) µ 1 ...µr;h,l are referred to as the polynomial coefficients. In the case of OPP reduction, this decomposition only serves the purpose of improving the efficiency of the computation since the polynomial coefficients can be recycled for all the different values of the loop momentum for which N ( ) l,h must be evaluated. In the case of TIR, this decomposition is essential since the object reduced are the tensor integrals T (r) l : We can now write the more precise form taken by OPP and TIR reduction: . (2.8) It is now manifest that the output of the OPP reduction depends on both the loop and helicity considered while the TIR output only depends on the loop considered (i.e. the tensor integrals T (r) l only carry a dependence on the index l, not h). For this reason, the number of independent OPP reductions performed per kinematic configuration is necessarily L×H, that is the number of loop amplitudes times the number of helicity configurations. On the other hand, the number of independent 4 TIR per phase-space points is only proportional to L since tensor integrals can be recycled across helicity configurations. For this reason, the evaluation of loop-induced matrix elements summed over helicity configurations is typically faster using TIR. Conversely, for the computation of a single helicity configuration, OPP type of reduction is preferable since the implementations of TIR currently available in MadLoop, namely PJFry, Golem95 5 and the in-house implementation IREGI, are slower for an equal number of calls to the Red[] operator (see appendix B for quantitative results on benchmark processes). Note that this statement is highly dependent on the TIR implementation considered, and comparisons presented in ref. [13] suggest that it might not hold true for the private TIR implementation in COLLIER [32,33]. However, due to the different scaling of the complexity of TIR and OPP reduction with the rank of the loop integral, it is expected that OPP is always faster for larger ranks (typically r 6).
Given the above, it is unclear whether event generation is more efficient using TIR and an explicit sum over helicity configurations or using OPP reduction and a Monte-Carlo (MC) sampling over them. The outcome mostly depends on the quality of the helicity discrete importance sampling (i.e. the magnitude of its mild dependence with kinematics), the runtime speed of TIR implementations and the relative importance of the computational cost of the determination of the polynomial coefficients (which scales like L × H). Our tests show that generating events with an MC over helicity configurations, using an independent importance sampling for each integration channel, yields better timings and we use it as our default (see sect. 2.2).
We now turn to the description of two additional minor improvements on MadLoop. First, many loop-induced processes occur only via closed fermion loops where often the different massless flavours bring the exact same contribution and can therefore be recycled. Identical diagrams (couplings, propagator spins, masses and widths are compared) are detected at generation level and then traded for a multiplicative factor affecting one chosen representative diagram. 6 For example, this leads to an improvement of a factor two for the production of electroweak bosons via gluon fusion. Secondly, because of Furry theorem, Feynman loop diagrams with an odd number of photon external legs can be exactly zero and their presence slows down the integration because MadEvent must probe the corresponding channels many times before deciding that it can be safely discarded. To avoid this, MadLoop has been modified so as to detect and remove such Furry loops at the diagram generation step (with a notification to the user).
The loop-induced matrix element codes generated by MadLoop have been validated in several ways. First, we compared numerical results for specific kinematic configurations of many processes against MadLoop4 [10], whose implementation is completely independent of MG5aMC, and against MadLoop5 in a mode that does not use the polynomial decomposition of the integrand numerator (see appendix A.2). The list of processes cross-checked includes, among others, dd → hggg and all gluon fusion processes with up to three identical neutral massive bosons in the final states. To facilitate future comparisons, we report in appendix B the numerical result for a chosen phase-space point of the process gg → hhgg. Secondly, at the integrated level, we compared our prediction for the partial decay width z → ggg to the result reported in ref. [34]. 7 Additional comparisons with the Higgs Effective Theory are presented in sect. 4. Finally, we tested Lorentz invariance, crossing symmetries and gauge invariance using Ward identities with the standard checks implemented via the command 'check' of MG5aMC interface.

Phase-space integration and event generation
The phase-space integration and event generation is based on the MadEvent [26] algorithm that we improved in the context of this work. At its core lies the diagram enhancement method which separates the integration into a sum of integrals whose singularity structure is dictated by a single Feynman diagram topology. Each of these individual integrals, referred to as integration channel, is then integrated using an appropriate phase-space parametrisation undoing its underlying structure. For this reason, using MadEvent phase-space mapping algorithms requires to build tree-level topologies from the contributing loop diagrams. This is achieved by contracting the loop to a single vertex point. It is important to stress here that this mapping to tree topologies is only used to setup the phase-space parametrisation, and at no point to build any sort of numerical estimate of the corresponding loop-induced amplitude (the full exact loop-induced matrix element is used throughout our implementation).
A MadEvent run involves two steps. The first one is referred to as the survey and consists in computing the cross-section for each integration channel down to a given accuracy of 5%. Using the information on relative cross-sections and efficiencies provided by the survey, MadEvent proceeds with a second step referred to as the refine where event generation takes place. The time necessary to run the refine step is approximatively linearly proportional to the requested number of unweighted events whilst the one of the survey is independent of this number.
We have implemented a series of improvements to MadEvent, the most important of which being the implementation of a dynamical importance sampling for the Monte-Carlo over helicity configurations (i.e. the frequency of probing a particular helicity configuration is dynamically adjusted to its relative contribution). Notice however that, as it is the case for standard adaptive Monte-Carlo [35], we do not account for any correlation between the Helicity sum

Monte-Carlo Exact
Loop Reduction Table 1. Cumulated CPU-hours necessary for the generation of 10k unweighted events for various loop-induced processes. This corresponds to a Monte-Carlo accuracy on the cross-section typically better than a percent. The numbers in parenthesis specify the number of polarised (even when summing exactly over helicity configurations) phase-space points for which the matrix elements needed to be evaluated. The column 'TIR' only reports the fastest timing between using the PJFRY and IREGI implementation of TIR and we suffixed the timing by a star ( ) when the latter is faster. For each process, we underlined the fastest of all approaches by reporting it in bold font.
kinematic variables of integration and the sampling distribution of helicity configurations. The gain obtained thanks to this new operational mode is made explicit in table 1. In that table and for a small set of loop-induced processes, we present the cumulated CPU-hours 8 necessary for the survey step as well as for the refine to 10k unweighted events. We also report the total number of phase-space points for which the matrix-elements needed to be evaluated during the integration. 9 We warn the reader that the quantitative results reported here must not be interpreted too literally given that the speed of the node assigned by the cluster can vary from one run to the other. We compare timings for an integration using a Monte-Carlo sampling with an explicit sum over helicity configurations. In the latter case, we further compare three different reduction methods; OPP as implemented in CutTools and TIR as implemented in PJFry or IREGI. We find that when summing exactly over helicity configurations for each phase-space 8 This is equivalent to how long the integration would have taken if it was run sequentially on a single CPU. 9 Notice that for the processes pp → hj and pp → hjj, the repartition of phase-space points probed across different partonic subprocesses is not the same for the survey and refine steps or between the two different integration techniques. This explains for example the difference of timing per phase-space point between the survey and refine steps.
points, TIR is advantageous thanks to the recycling of tensorial coefficients across these configurations (see sect. 2.1). However, in the case of processes of larger multiplicity (pp → hjj) or with many helicity configurations (gg → zz), this is not sufficient to overcome the gain obtained from sampling over helicity, which considerably reduces the number of polarised phase-space points that needs to be probed to reach a given accuracy. Given that the typical time for the evaluation of a 2 → 4 loop-induced matrix element for a phase-space point can already reach several seconds (see table 14 in appendix B), it is of paramount importance to be able to scale MadEvent parallelisation independently of the number of integration channels. In this way, even though the cumulated sequential CPU-hours necessary for the computation remains constant, the user time spent for the integration of a process can be reduced proportionally to the number of cores available. Although Monte-Carlo integration methods are intrinsically trivially parallel, adaptive techniques, where the phase-space probing distribution is improved over time, break the mutual independence of two successive iterations. The original approach of MadEvent implements the simplest parallelisation in this context, which consists in running each channel of integration as an independent separate job. However, this is not sufficient for loop-induced processes which exhibit a small number of time-consuming channels. To push parallelisation further, we have therefore modified the steering of MadEvent so as to submit multiple jobs on the cluster for each iteration. In between iterations, the phase-space sampling grids and cross-section results from each job are combined together to build the input of the next iteration which is again split into a series of new independent jobs.
Finally, we have also improved event generation. In MadEvent, the unweighting operation follows each iteration and if it did not generate enough unweighted events then a new iteration is submitted -setup to probe twice as many phase-space points-and the events already generated are simply discarded. We ameliorated the unweighting step by allowing to combine events generated in any of the previous iterations. One caveat lies in the fact that if the MC sampling grids improve significantly from one iteration to the other, this procedure does not take advantage of the resulting increase in unweighting efficiency. We remedy this problem by estimating the remaining computing time for each of the two strategies (discarding or not events from previous iterations) and choosing the most efficient one.
Using a still private implementation of an interface between MadLoop and SHERPA, we carried a detailed comparison at the integrated level for the loop-induced process gg → ZZ, including both the triangle (featuring an s-channel Higgs) and box topologies. In particular, we compared to the per-mil accuracy the cross-sections for each individual squared topology as well as for their interference only. We found perfect agreement between

Results in the Standard Model
In this section, we present results for inclusive cross-sections of loop-induced processes within the Standard Model, whose parameters are set to the values listed in table 2.

Parameter value
Parameter value We considered all hadronic loop-induced processes in the SM up to three particles in the final states, featuring one, two or three heavy bosons and/or photons as presented in tables 4, 5 and 6 respectively. The seemingly missing processes are all zero, either because of Furry's theorem (such as in gg → ha for instance, where all fermionic loops with clockwise and anti-clockwise flow cancel pair-wise) or Landau-Yang's theorem [37,38] that forbids a massive vector particle to decay into two identical massless ones, such as in gg → z. In addition, we present in table 7 cross-sections obtained for a selected list of processes with four external final states as well as loop-induced processes with non-hadronic initial states.
For each process, we generated a sample of 10k unweighted events, yielding a Monte-Carlo accuracy on the inclusive cross-section of at least 1% but typically better. The central factorisation and renormalisation scales are set dynamically to half the sum of all final state transverse energies in the case of scattering processes and statically to the decaying particle mass in the case of decay processes; that iŝ Processes with at least one external state Higgs accompanied by only jets and/or photons do not receive any tree-level contributions even in the presence of quarks. For these processes we considered the contributions from the gluon and all four massless quark flavours in the proton and jet definitions, that are denoted by p and j. For all other processes, we restricted ourselves to only gluons in the external states since the loop corrections to the corresponding processes with external quarks are formally NNLO. Notice that, unless the loop amplitudes are finite, the virtual-virtual contribution of NNLO corrections can, in principle, not be computed with our implementation because loop integrals are evaluated only up to O( 0 ) in dimensional regularisation. There is however a proposal [39] for a formalism avoiding the computation of the O( ) terms of one-loop amplitudes in NNLO computations by capturing these terms with the one-loop insertion operator instead.
For all processes with coloured initial states, we used the MSTW 2008 PDF set [40] (with name 'MSTW2008lo68cl_nf4'). For each process, we indicate the maximum scale variation, denoted by ∆μ, obtained by independently multiplying the scales µ r and µ f by the customary factors one-half, one and two. We also show the PDF uncertainty ∆ P DF obtained from the corresponding MSTW error sets. Both these quantities have been computed from a single run with central PDF set and scale using the exact reweighting approach implemented by the MG5aMC module SysCalc [41].
We applied the cuts listed in table 3 to all processes when applicable.
Cut Constraint Comment p t,j > 20 GeV j ≡ gluons and massless quarks p t,a , p t,l > 10 GeV a ≡ photon, l ≡ any lepton Table 3. General set of cuts applied (when applicable) to all processes of tables 4-8.
In addition to these cuts, some processes are subject to more specific changes in the simulation setup: Cut on heavy boson transverse momenta: p t,V > 1 GeV. The fermionic 4-point loop integral with massive identical final state vectors V features an integrable singularity at p t,V → 0 that we regulated with a technical cut at 1 GeV, alike what is done in the latest version of the code MCFM [42].
Some of the diagrams contributing to this process can have up to four loop propagators onshell 10 , and some of the resulting four-point scalar loop integrals are unstable for kinematic configurations with a total invariant mass slightly above twice the top mass.
To avoid this issue, we considered here a non-zero top quark width within the complex mass scheme [43,44], but keeping the weak boson widths set to zero since they are onshell in the external states. Such a partial assignment of the widths within the Contrary to the two other leptonic scattering processes presented in this section, this one does not involve fermionic loops but genuine weak loops. We stress that the complex mass scheme would be the optimal approach for accounting for finite-widths effect since it yields gauge-invariant results. 11 This process is extremely rare in the SM for three reasons: it only occurs via a box loop with two W -boson propagators (see Fig. 1) and it is suppressed by the square of CKM matrix off-diagonal elements as well as the bottom quark mass because of the GIM mechanism. Observing such final states would therefore be a clear indication of new physics. We chose the following values for the additional parameters entering this computation: Γ W = 0 and the CKM matrix set to non-unity, expressed in terms of the Wolfenstein parametrization [46], with λ = 0.2253, A = 0.808, ρ = 0.132 and η = 0.341.
• Decay processes (g.x): Processes g.1, g.4 and g.7 are computed completely inclusively. For processes g.2, g.3, g.5 and g.7, all pairs of identical final state objects are restricted to have an invariant mass ∆m jj/aa > 10 GeV and all decay products must have an angular separation of ∆R j/a,j/a > 0.4.
In tables 4-8, we denote by a dagger ( † ) processes whose inclusive cross-section has never been published and is, to the best of our knowledge, reported here for the first time. We prefix by a star ( ) processes which are not readily available in the main public simulation tools MCFM [42], VBFNLO [47] or HPAIR [48].

A close-up on Higgs production
The purpose of this section is to illustrate the capabilities of MG5aMC in the context of loop-induced processes and to present additional validation material. All results in this section are for the LHC 13 TeV using generation level cuts and SM parameters identical to those of sect. 3. We focus here on Higgs production via gluon fusion and compare it to the Effective Field Theory (EFT) approach. This theory approximates the loop by the effective operator: where H is the Higgs field, G a µν is the gluon field strength tensor and C is the Wilson coefficient (known up to N 4 LO) [68][69][70]. This approximation was recently used in the context of the computation of the Higgs cross-section at N 3 LO QCD accuracy, exhibiting a scale uncertainty of 2% only [6]. At this level of precision it is important to consider the corrections to the EFT approach, by including effects due to the finite top quark mass [71][72][73] as well as the interference between top and bottom quark loops [74,75].
The importance of these contributions is already significant at leading order accuracy. In table 9, we compare the LO cross-section for the Effective Field Theory approach (EFT) 12 with the SM exact loop-induced process in two scenarios: first with massive bottom quark (labelled 'LI') and with massless bottom quark (labelled 'NoB') where the contribution of the bottom quark loop is vanishing (both when interfering with the top quark loop and when squared against itself). For the zero jet multiplicity, we have a perfect agreement between the 'NoB' case and the 'EFT' one because the effective field theory is valid when √ŝ = m H < 2m t . 13 In presence of radiation, the total energy of the event can lie outside  Table 9. Comparison of the cross-section for loop-induced Higgs production computed in different setups. The first column reports the prediction from the Effective Field Theory (EFT) limit using the 'heft' model [76]. The cross-sections of the second column (LI) are computed by directly integrating the loop-induced diagrams, hence keeping all mass effects. The setup of the last column (NoB) is identical to that of the second one, except for the fact that the contribution from bottom quark loops is removed. We do not include the vector boson fusion contribution in the cross-sections reported for the process pp → Hjj.
the validity range of the EFT theory and this explains the difference observed between these two schemes. Fig. 2 (left part) shows the Higgs p T distribution for various computational setups and it clearly demonstrates that the EFT breaks down for large Higgs transverse momenta.
Another effect emphasised in table 9 is the importance of the interference of the top and bottom quark loops. For the zero jet multiplicity case, this interference is negative because of the absorptive part of the bottom quark loop which is below threshold, unlike the purely real top quark loop. In presence of additional jets, the interference can be either positive or negative, leading to a milder effect on the cross-section but affecting the shape of the distribution. In Fig. 2 (right part), we show the interference contribution as a function of the Higgs transverse momentum (for the process pp → Hj). At low p T , we recover the behaviour of the 0-jet multiplicity case featuring a negative interference effect which contributes to 10% of the total cross-section. Conversely, at high p T , the interference of the bottom and top quark loop diagrams is constructive and its relative contribution tends to a constant equal to the ratio of the corresponding Yukawa couplings: y b y b +yt = 2.7%. The left panel of fig. 2 features a fourth prediction labelled 'RWGT', short for reweighting. This prediction corresponds to the differential cross-section predicted by the Effective Field theory where the weight of each event has been rescaled by LI being the loop-induced matrix element for the helicity configuration h computed for the kinematic configuration of the event considered. With infinite statistics, the prediction based on reweighting must be identical to the one obtained from a direct integration of the loopinduced matrix element. We check that the ratio between these two predictions (shown in the bottom inset) is always compatible with one, given the statistical uncertainty whose one-sigma variation is given as the yellow band. This represents a non trivial validation of MadEvent phase-space integration for loop-induced processes.
As it is the case for tree-level computations, MG5aMC can perform matched and merged computations in the context of loop-induced processes. The MLM merging (either k T -MLM or shower k T [77]) is available and fully automatic when linked to Pythia6 [78].
As an example of such computation, we present plots for Higgs production merged with up  to two additional jets in the k T -MLM scheme. In Fig. 3, we present the N th differential jet rate plot (DJR <N > ) which corresponds to the scale for which the k T clustering algorithm switches from an N to N-1 jet description, that is the scale associated to the N th emission. This observable is most sensitive to the matching/merging since it is directly related to the variable disentangling matrix-element emissions from parton shower ones. The parton shower is not allowed to radiate in the phase-space region above the merging scale (set here to Q match = 50 GeV as in [79]) since this region is already populated by partonic events generated using matrix elements of higher multiplicity. 14 Therefore, having a smooth transition at Q match for the sum of samples (distribution in black) is an important check of the quality of the matching/merging procedure. For DJR 1 , a small discontinuity can be observed, originating from the fact that the bottom and top quark interference effect is not accounted for by the parton shower.
Finally, fig. 4 shows the normalised distribution of the Higgs transverse momentum obtained from the same matched and merged sample. This plot also presents the transverse momentum of the hardest photon when considering the Higgs decaying into two photons (simulated in the narrow-width approximation). Notice that the discrepancy between the EFT and the loop-induced prediction is only due to differences in the production mechanism since the (loop-induced) matrix element for any 2-body decay is a constant over phase-space.

Loop-induced processes within BSM models: the 2HDM example
Loop-induced processes simulation with MG5aMC is not limited to the Standard Model but can also be extended to a large class of BSM models. The framework supports models following the NLO UFO format [80][81][82] which can be automatically created by Feyn-Rules/NLOCT [83,84]. Since loop-induced processes are finite, it is not required to have UV counterterms specified in the NLO UFO model, therefore allowing to compute loopinduced processes even for models for which automatic UV renormalisation is currently not possible. The computation of the R 2 Feynman rules is however still necessary (see sect. 2.1).
14 There are two exceptions to this rule: first, for the highest multiplicity sample, the shower can radiate up to the scale of the softest matrix element jet and secondly a jet can be excluded from the matching if an emission is due to a diagram which does not have a counterpart in the parton shower description. In the latter case, the event is seen by the matching procedure as having one less jet, since the corresponding jet emission is excluded. This explains why the dashed blue curve (corresponding to the single jet multiplicity) does not vanish above the matching scale for the DJR2 plot (left panel of fig. 3).   Table 12. Inclusive cross sections (in fb) for the production of a pair of 2HDM charged scalars at the 14 TeV LHC for the three benchmark parameters sets introduced in table 10. The first uncertainty (in percent) refers to scale variations by factors 1 2 ,1 and 2 while the second one refers to PDF uncertainty. Both are computed by SysCalc at no additional computational cost.
In this section, we present results in the two Higgs Doublet Model (2HDM) [85] for the three benchmark parameter sets presented in table 10 and introduced in refs. [27,86]. Ref. [27] computes the cross-section of the associated production of a Z boson with all the 2HDM scalars using a reweighting approach. As a validation of our method, we have reproduced the computation presented in table 6 of ref. [27]. The only two differences with respect to the parameters used in sect. 3 is the c.o.m. energy set to 14 TeV and the renormalisation and factorisation scales set to the partonic invariant mass. Our results are reported in table 11 and show perfect agreement.
In table 12, we present results for the production of a pair of charged Higgs via gluon fusion (see Fig. 5) and we compare it to the tree-level channel with initial-state quarks.  Even if the loop-induced contribution is here formally NNLO, it is numerically important. Moreover, the loop-induced contribution differs significantly in its kinematic distribution, as illustrated in Fig. 6 where we show the system invariant mass and the transverse momentum distribution of each partonic subprocess. For all benchmark points, the distributions are softer for the loop-induced production channel because of the differences in the relevant parton density functions.

Conclusion
We have presented a new public extension of MadGraph5_aMC@NLO capable of automatically simulating loop-induced processes within both the Standard Model and BSM theories. Our implementation does not rely on any predefined set of processes and instead generates on demand an optimised code tailored to the loop-induced process of interest. This is achieved by combining the capabilities of MadLoop for the generation of loopinduced matrix elements with those of MadEvent for phase-space integration. Particular emphasis is put on run time and we have optimised both aforementioned codes for the purpose of loop-induced processes whose matrix-elements are considerably slower to evaluate than tree-level ones. We have first determined that, within our framework, the most efficient approach is to employ the OPP reduction method for a given helicity configuration, which we pick for each phase-space point according to a probability set by importance sampling. A second improvement is the computation of the loop matrix elements by projecting its colour structures onto the colour-flow basis. This also provides the associated leading colour information necessary for event generation and required by parton shower programs. Finally, we have restructured MadEvent code so as to allow for an arbitrary parallelisation of the computation.
Thanks to these developments, it is possible to compute all 2 → 2 loop-induced processes on a laptop and all 2 → 3 ones (and most 2 → 4) on a small size cluster. As a proof of this statement, we have computed a large list of cross-sections in the Standard Model including all loop-induced processes up to 2 → 3 and a couple of 2 → 4 processes as well as some decay and non-hadronic scattering processes. The implementation was carefully validated against results in the literature for both integrated distributions and local phase-space points.
Our approach is fully differential and compatible with standard matching and merging procedures like the MLM approach supported by MG5aMC; this is emphasised by phenomenological results and distributions presented for the case of Higgs production. Finally, the flexibility and generality of MG5aMC regarding the support of BSM models is maintained and we illustrated this with results for various loop-induced processes within the two Higgs doublet model. This work fills a gap in the spectrum of modern tools capabilities and it opens the door to many applications whose exploration will no longer be hindered by the technical difficulties of simulating loop-induced processes.

A Manual
In this appendix, we present minimal information on how to generate events for a given loop-induced process. We then proceed with the description of the new options introduced in MG5aMC version 2.3.0 and their particular application in the context of loop-induced, tree-level and NLO computations.

A.1 Basic commands
The steps needed to generate loop-induced processes are very similar to the ones needed to generate a tree-level and/or an NLO computation within the MG5aMC framework. For the reader familiar with that environment, it should be enough to stress that for processes without any Born diagrams, the 'generate' command suffixed with the NLO syntax (i.e. adding '[QCD]' at the end of the process definition) will automatically consider the corresponding loop-induced process. From now on, we adopt a pedantic approach and do not assume any prior knowledge of the tool.
The MG5aMC framework is a meta-code generating optimised numerical programs for the simulation of user-defined scattering/decay processes [18]. The code offers an interactive interface, that can be started with the script <MG_install_path>/bin/mg5_aMC, where the user can enter commands to specify the processes/runs he is interested in. 15 All commands and options are documented within the code, and detailed information can be obtained by typing 'help COMMAND'. The list of all commands available are listed by typing 'help'. Finally, a built-in tutorial can be started using the command 'tutorial'. We now proceed to describing the succession of the four commands characteristic of the steering of a simulation.
• import model <MODEL_NAME>: This command allows to import a new model. In the context of loop induced processes, only NLO UFO models are supported. Such models can be generated via FeynRules [83] thanks to the NLOCT package [84]. • output <PATH>: This starts the generation, in the specified path, of the numerical code for the process of interest. example: output MY_NEW_RUN • launch <PATH>: This command starts the actual computation of the cross-section and the generation of events. The code first asks two preliminary questions. First, the list of external tools (Pythia6 [78], Delphes [87], MadSpin [88], ...) that one wishes to use in this simulation can be turned on. Then, the user is given the option to edit various configuration files, including the model parameters. This part is identical to the case of leading order computations and we refer to [18] for more details in this subject. Once the simulation is completed, various outputs are written in the folder '<PATH>/Events' and a convenient html summary is generated in '<PATH>/crossx.html '.

A.2 Description of new MadGraph5_aMC@NLO options
In this new version of MG5aMC (version 2.3.0), we have introduced a couple of additional options related to the simulation of loop-induced processes. The following options can be modified in the file 'input/mg5_configuration.txt' or directly via the interface using the 'set' command.
• cluster_size [default=100 ]: This parameter allows to arbitrarily scale the requested number of jobs to be proportional to the size of the cluster. By increasing the cluster size, MG5aMC splits iterations into more jobs, each probing less phase-space points.
As of now, this parameter applies for the simulation of loop-induced processes only.
• cluster_local_path [default=None]: This parameter avoids either to transfer PDF sets to the cluster nodes or to read them directly on a central disk. This path should point to a (node specific) directory containing the associated PDF sets (either those from LHAPDF or built-in ones). A typical usage is to set this path to a local directory mirrored via cvmfs.
• max_npoint_for_channel [default=3 ]: This parameter controls what topologies enter the multi-channeling to be used for integrating loop-induced processes. For instance, when set to 4, all loop diagrams with 4 or less loop propagators seed their own channel of integration. This means that tree topologies obtained by shrinking up to box loop diagrams are considered for the multi-channeling and therefore integrated separately.
In general, we do not observe any significant gain (when not detrimental) when setting this parameter to 4 or larger (except in the case of gg → zz). We stress here that, for this parameter to take effect, it must to be modified prior the generation of the source code of the process considered.
• loop_colour_flows [default=dynamical ]: The computation of partial colour subamplitudes (i.e. amplitudes for fixed colour flows) is turned off by default for the case of NLO virtual matrix elements and turned on for loop-induced matrix elements.. This is because in the former case it comes at the price of giving up loop reduction at the squared amplitude level, hence slowing down MadLoop execution speed since the number of OPP reductions is no longer independent of the number of contributing helicity configurations. This option can however be turned on (before MadLoop writes out the source code for the process) since colour subamplitudes can be necessary for certain applications, such as NLO event generation within the context of a controlled colour expansion [89] and/or Monte-Carlo over colours. Partial colour amplitudes (called JAMP in the code) can then be accesses and combined as needed by modifying the user-defined subroutine '*_COM-PUTE_COLOR_FLOWS_DERIVED_QUANTITIES ' present in the source code file 'compute_color_flows.f '. • sum of the transverse mass divide by 2 partonic energy √ŝ . On top of the parameters above which control the way the code is generated and then handled by the cluster, we also introduced new entries in 'run_card.dat', which can be accessed and modified at any time, including after code generation.
• nhel: This parameter was already defined in previous versions of the code but its effect changed in MG5aMC v2.3. 'nhel' can now only be set to '0' or '1', in which case all, respectively exactly one, helicity configuration(s) are/is considered for each phasespace point. The difference with previous versions is that the helicity configuration picked for each phase-space point is chosen according to a dynamical importance sampling Monte-Carlo method. The default value for this parameter is taken to be '1' for loop-induced process and '0' for tree level computations (this parameter is not available for NLO computations). We stress that when setting this parameter to '0' for loop-induced processes, it is more efficient to select the tensor integral method PJFry or IREGI as the preferred reduction by modifying the parameter 'MLReductionLib' of the 'MadLoop_card.dat' configuration file.
• dynamical_scale_choice: When a dynamical scale is chosen, this parameter selects its functional form (c.f. table 13 for a list of predefined functional forms).
• survey_splitting: For the phase-space integration of loop-induced processes, this parameter allows to define the number of nodes assigned to the integration of one given channel integration during the survey/gridpack generation. Notice that the parallelisation of the second stage of the simulation (which completes event generation and is referred to as the refine step) is controlled by the parameter cluster_size instead.
• job_strategy: This parameter controls the parallelisation strategy for the simulation of high multiplicity tree level simulations (especially important for an efficient handling of the simulation of the highest multiplicity sample in the context of an MLM merging gridpack generation.). The three possible integer values taken by this option are: -'0': The original behaviour already adopted in previous versions where each submitted jobs performs successively the integration of two channels of integration.
-'1': This mode changes the behaviour for the highest multiplicity sample, where the associated jobs run a single channel of integration. The number of jobs submitted in this case is therefore increased by two, hence reducing the running time for this sample by the same factor.
-'2': This mode uses the same algorithm as the one introduced for loop-induced computations for the highest multiplicity sample and uses the mode '1' for the next-to-highest multiplicity sample.
The default value of some of the 'run_card ' parameters are now dynamically chosen depending on the process considered, so as to better reflect what is typically expected in this case. These parameters remain accessible and can be modified by the user; only their default value is changed. The parameters for which the default value can differ depending on the process considered are: • energy of the beam: for electron-positron collision, the default energy is set to 1 TeV (500GeV per beam). The default collision energy for all other processes is 13 TeV.
• PDF type: if the particles in the initial states of the process are constituted of quarks, gluons or photons then the proton PDFs are used. Otherwise, this parameter is set to the fixed energy scheme without PDF.
• MonteCarlo over helicity: Monte-Carlo over helicity is turned off for tree-level simulations and turned on for loop-induced ones. This options is not available for NLO computations.
• maxjetflavour/asrwgtflavour: set accordingly to the number of quark flavours appearing in the initial states.
• cuts: for (LO) width computation, all cuts are turned off by default. Unlike previous versions of the code, the cuts defined in the run_card.dat are now applied to the computation of the partial width as well.
• matching parameter: At LO, if all the processes considered only differ by their jet multiplicity, the MLM matching/merging (ickkw=1) scheme is turned on by default with a parton level cut ('xqcut') of 30 GeV.
Finally, a new MadSpin option allows to simulate arbitrary decays (including threebody and loop-induced decays) but without any spin correlation (between production and decay) and without BreitWigner smearing. The detailed description of this new functionality is deferred to a future work, and here we limit ourselves to mentioning that this feature can be used by adding the line 'set spinmode none' at the beginning of the MadSpin configuration file.

B Benchmark results for various loop-induced Higgs production processes
We start here by presenting the numerical result for the evaluation of the matrix element for the loop-induced process gg → hhgg, chosen both for its complexity and importance as a background to double Higgs production in vector boson fusion. The SM parameters used in this computation are those specified in We report here the squared loop-induced matrix element 16 computed for the phase-space point above, summed over all helicity and colour configurations.
|M (LI) | 2 1.3033633142042775e-12 This particular computation is performed in quadruple precision and stability tests show that all 17 double precision digits are numerically stable. As for NLO computations, the standalone MadLoop output can be used for crosschecking an independent implementation of the calculation (see how in sect. A.1). We now turn to listing MadLoop performances in table 14. On top of the process gg → hggg, we also consider gg → hh, gg → hhg and gg → hggg so as to reflect the scaling of MadLoop timings with the multiplicity and number of Feynman diagrams. The profiling presented for these processes, as well as for any other, can be automatically reproduced by running the command 'check profile <process_definition>' from the MG5aMC interactive interface. The timings indicated do not include any numerical stability test nor do they account for the fraction of points for which it is necessary to use quadruple precision arithmetics. The percentages in parenthesis specify the fraction of the running time 16 Notice that the customary one-loop matrix element prefactor   spent for the computation of the polynomial coefficients of the numerator integrand. The complementary time is entirely spent in the loop reduction algorithm. Table 14 illustrates various points discussed in the technical sect. 2.1, especially the fact that only the OPP loop reduction time scales with the number of helicity computed. We refrained from showing the timing using Golem95 reduction for the case where the matrix element is summed over helicity configurations, because the recycling of tensorial coefficients is not yet implemented for this tool. In general, even though TIR reduction allows for faster timing when summing over helicity configurations, it remains much larger than for the computation of a single helicity configuration, hence our default approach of using OPP reduction in conjunction with a Monte-Carlo over helicity configurations. The number of topologies refers to the number of different sets of loop propagator denominators present in the computation. In the computation of the virtual matrix element for NLO predictions, the integrand numerators of all loops sharing the same denominator topology can be added together before being reduced, hence greatly diminishing the number of necessary OPP reductions. This grouping of topologies is not applicable for loop-induced computation when using OPP reduction, but TIR can benefit from it since tensor integrals can be recycled across loops sharing the same denominator topology.  It is interesting to note that the reduction time in TIR is almost the same between the processes gg → hhgg and gg → hggg, even though the latter has twice as many diagrams. This is mainly because the reduction time is dominated by the reduction of topologies with maximal tensorial rank (r max = 6 here), of which there is the same number in these two processes.
Generation time, output code size and RAM usage show that MadLoop is light-weight and practical for the computation of loop-induced amplitudes with up to at least 5 external legs, and well-suited for cross-checking other codes for more complicated processes.
The various reduction methods compared in table 14 have different ranges of applicability, which we summarise in table 15. MadLoop checks these constraints at runtime for each loop and for each available reduction tool before using the first one applicable. It is possible that a mixture of different reduction methods is used within a single loop matrix element computation. Notice that, because of a certain class of hexagons which are not supported, PJFry is by default limited to rank-5 loops. We have disabled that limitation for the present benchmark since the timing is unaffected. Also, IREGI is the only reduction tool capable of reducing loops with tensorial rank exceeding the number of loop propagators by more than one unit.