Optimising simulations for diphoton production at hadron colliders using amplitude neural networks

Machine learning technology has the potential to dramatically optimise event generation and simulations. We continue to investigate the use of neural networks to approximate matrix elements for high-multiplicity scattering processes. We focus on the case of loop-induced diphoton production through gluon fusion and develop a realistic simulation method that can be applied to hadron collider observables. Neural networks are trained using the one-loop amplitudes implemented in the NJet C++ library and interfaced to the Sherpa Monte Carlo event generator where we perform a detailed study for $2\to3$ and $2\to4$ scattering problems. We also consider how the trained networks perform when varying the kinematic cuts effecting the phase space and the reliability of the neural network simulations.


Introduction
Phenomenological studies of high multiplicity final states at collider experiments present a substantial theoretical challenge and are increasingly important ingredients in experimental measurements. During the last 15 years, a dramatic improvement in computational algorithms for one-loop amplitudes has led to a number of highly automated codes capable of predictions at next-to-leading order (NLO) accuracy in the Standard Model (SM) [1][2][3][4][5].
These codes are based around numerical algorithms which bypass the growth in algebraic complexity that analytic approaches suffer from. The computational cost of these algorithms is however relatively high, resulting in huge commitments of CPU and personnel resources to obtain the necessary theoretical predictions for current experiments.
The interface of general one-loop amplitude codes into multi-purpose Monte Carlo (MC) event generators has resulted in a wide variety of simulation options which can offer the best possible theoretical accuracy. Methods that go beyond fixed-order perturbation theorysuch as parton shower matching, merging, and jet multiplicities -improve accuracy across important regions of phase space. However, these simulations add additional strain on the underlying amplitudes.
State-of-the-art tools make use of advanced phase-space mapping algorithms to improve the convergence of the multi-dimensional integration. General purpose MC event generators such as Sherpa [6,7], Pythia [8,9], Herwig [10][11][12], and MadGraph [4] often make use of the diagram structure of the underlying tree-level process to ensure an optimal distribution of points. Reusing tree-level distributions when generating virtual events is particularly effective at reducing the computational cost of using complicated one-loop amplitudes.
In this paper, we will consider a class of scattering processes that contribute to diphoton signals at hadron colliders. The process gg → γγ + n(g) is a good test case for machine learning (ML) technology, since it is loop induced and has relevant contribution from high multiplicity matrix elements. Using automated tools at NLO, full QCD corrections are known for pp → γγ + ≤ 3 jets [13][14][15]. There has been a flurry of recent activity around next-to-next-to-leading-order (NNLO) corrections to pp → γγ + j in which the complete leading colour corrections have been presented [16][17][18][19][20]. As such, 2 → 3 and 2 → 4 scattering for the gluon-initiated diphoton channel are now extremely relevant for future phenomenological studies.
ML is now extremely popular in high energy physics with a wealth of applications. The majority of problems concern classification and exploit the scalability of ML algorithms to large datasets. The fact that neural networks (NNs) offer a general function parameterisation that are also useful in regression problems is not new in particle physics either: for example, parton distribution functions (PDFs) produced by the NNPDF collaboration [21] have been used for many years by the LHC experiments.
Interpolation methods such as polynomial fits and interpolation grids have often been used to provide fast simulations and predictions of differential observables, particularly at NNLO where there is an intricate structure of infrared (IR) singularities [22][23][24][25][26]. These techniques are extremely powerful for problems with two or three hard scattering variables, but scale poorly at high multiplicity.
ML offers a possible solution to the poor scaling with larger number of variables. Various directions are currently being explored for different aspects of event generation and MC integration. Boosted decision trees (BDTs) and NNs have been applied to phase-space sampling and integration [27][28][29] and have shown to increase the speed at which functions can be integrated, as well as allow for the integration of functions for which traditional algorithms failed. More recently, sampling using normalising flows [30,31] has been attempted [32][33][34]. These have been shown to avoid the computational cost of calculating the gradient of the network itself, when determining the Jacobian, which previous NN approaches required.
Several works have focused on developing NN techniques for explicitly learning the cross section of a given process [57,58]; however, little research has been done on learning the matrix element itself for a given phase-space point and process. Ref. [59] took a parallelised BDT approach to learning the matrix element, and tested this on the loop-induced gg → ZZ process at leading order, demonstrating the potential for large speedups in matrix element calculations. An ensembled NN methodology was presented in Ref. [60] which divides the phase space into divergent and non-divergent regions, and further splits the former into sub-regions corresponding to different IR singular structures. This approach was tested on e + e − → ≤ 5 jets at both tree and one-loop level, to ensure robustness at high multiplicity, and was found to provide a good approximation of the cross section and various differential distributions.
Readers may like to refer to the living review of ML in particle physics for up-to-date information about the state-of-the-art [61].
Our paper is organised as follows. We begin by giving a brief review of the structure of the underlying perturbative amplitudes that we study. We then describe the computational setup in which we use the NJet amplitude library [62] to train an ensemble of NNs that are then interfaced to the Sherpa MC event generator [6,7]. We discuss a strategy for the generation of the training dataset, and a reweighting strategy to refine the distributions of events where the networks are used for matrix element generation. We then discuss results for gg → γγg and gg → γγgg at leading order (LO). In each case, we present a selection of differential observables and provide a detailed comparison with a conventional simulation. We end with a presentation of our conclusions and a discussion of possible future uses of these technologies.

Gluon-initiated diphoton amplitudes
We study amplitudes with two photons and many gluons which first appear at one-loop level in the SM. With conventional simulations relying on cheaper LO tree evaluations to optimise event generation for NLO one-loop contributions, these loop-induced processes present an interesting sector to test new approaches for phase-space integration. Compact analytic computations for gg → γγ and gg → γγg have been available for some time and offer extremely fast and stable evaluation. As a result, it is feasible to optimise event generation with the one-loop evaluation. For 2 → 4 scattering problems, only numerical codes are available and simulations can be extremely slow. It also not clear that analytic formula would be sufficiently compact to alleviate this situation even if they were available.
For this reason, we consider an alternative setup where the whole simulation uses a NN approximation of the amplitude. The loop-level amplitudes proceed through a fermion loop and have a colour decomposition in the trace basis as where g s is the strong coupling, g γγ = e 2 q Q 2 q is the combined coupling of the diphoton system to the fermion loop, S n−3 is the set of even non-cyclic permutations of {1, . . . , n−2}, t a are the fundamental SU(3) generators, q runs over active quark flavours with fractional quark charge Q q , and the colour trace function λ is defined as λ (a 1 , . . . , a n−2 ) = tr (t a 1 t a 2 . . . t a n−2 ) + (−1) n tr (t a 1 t a n−2 . . . t a 2 ) . (

2.2)
This yields (n − 3)!/2 primitive amplitudes A for n ≥ 5. For example, for n = 4 there is a single primitive amplitude. It is given by the diagrams 3) where a plain line indicates a sum over quark loop arrow directions. At one-loop, these amplitudes are also related to the fermion loop corrections to pure gluon scattering through permutations [63]. The ingredients for differential cross sections are the squared amplitudes summed over helicities, h, and colour, where the matrix C is a function of the number of colours, N c , obtained by squaring the colour basis elements and the index on the partial amplitudes, A, refers to the different permutations in the colour decomposition. The amplitudes in this article are taken from the NJet C++ library [62]. Here, there are different options: a general numerical setup using generalised unitarity and integrand reduction; and hard-coded analytic expressions for n = 4, 5. The n = 4 analytic expressions were taken from Ref. [64], while for n = 5 they were obtained directly from a finite field reconstruction [65] and are in agreement with known analytic formula [63,66]. By using a momentum twistor parameterisation of the external kinematics, cancellations in the rational coefficients of the special functions that lead to a manifestly finite representation are easily identified.
The numerical evaluation requires the sum of permutations of ordered primitive amplitudes. This is completely automated for arbitrary multiplicity, but evaluation times and numerical stability are increasingly difficult to control.
To study the growth of evaluation time with multiplicity, we evaluate the matrix element at 100 random phase-space points with each available technique and plot the mean times in Figure 1. We generate the phase-space points isotropically with the algorithm from Ref. [67]. While analytic methods are competitive at low multiplicity, we see they scale poorly and are unlikely to beat numerics at n ≥ 6. Numeric scaling is better, but these algorithms come with a high cost. Our NN approach provides a performant alternative, with significantly better scaling than either numerics or analytics.

Evaluation time (ms)
Numerical Analytical NN ensemble Figure 1: Matrix element typical CPU evaluation times for available methods -including NJet numerical evaluations, NJet analytical evaluations, and inference on a NN ensemble as described in Section 3 -against the number of legs. These calls are single-threaded as parallelisation is applied at the level of events in simulations. An analytic expression for 2 → 4 is not available. The NN is comparable to the analytic call at 2 → 2, 50 times faster at 2 → 3, then 10 5 times faster than the 2 → 4 numeric call.

Computational setup
In this paper, we build on previous work which sought to demonstrate the viability of using NN-based approaches to approximate matrix element values for hard scattering processes [60]. In that work, a NN ensemble approach was presented in which a different NN is trained on each soft and collinear region of phase space, and was shown to be effective in handling IR divergent structures at both the Born and one-loop level at high multiplicity in e + e − collisions. We extend this to more complex 2 → 3 and 2 → 4 gluon-initiated diphoton amplitudes, while also showing the ability for these ML models to interface with existing event generators such as Sherpa [6,7]. This is important to demonstrate since it is not immediately obvious that NN approximations trained in isolation will be robust to the added intricacies of event generators which are important for extracting physical results, such as PDF weighting and choices of integrators. We begin this section with an overview of the setup for training our ML approach based on those presented in Ref. [60] and finish with a discussion on interfacing with event generators.
3.1 Phase-space partitioning IR divergences arise from soft and collinear real emissions and integrals over massless partons appearing in virtual corrections. Our ML approach requires the isolation of real emission IR singularities such that a different NN can be trained on each soft and collinear region [60]. To do this, we partition the phase space into divergent and non-divergent regions and then subsequently sub-divide the divergent region according to the FKS subtraction [68,69]. The original implementation of this method was for e + e − → qq + jets collisions in which the IR singularities only appear in the final state. In the processes discussed in this paper, IR singularities appear in initial-initial, initial-final and final-final state pair combinations. We therefore extend the partitioning to these states for hadronic processes as specified in Ref. [69].
As in Ref. [60], we parameterise our phase space according to the Lorentz invariant y ij = s ij /s 12 , where s ij = (p i + p j ) 2 . The PDF convolution creates a non-fixed partonic centre-of-mass energy √ s 12 .
We now define the divergent and non-divergent regions of phase space as R div = {p | min(y ij ) ≤ y p , p = (p 1 , p 2 , . . . , p n ), i, j ∈ {1, . . . , n}} , (3.1) R non-div = {p | y p ≤ min(y ij ), p = (p 1 , p 2 , . . . , p n ), i, j ∈ {1, . . . , n}} , where y p is fixed for the process (see Appendix B), p is a phase-space point consisting of the incoming momenta {p 1 , p 2 }, and the outgoing momenta {p 3 , . . . , p n }, where n is the number of particles in the process. The FKS pairs are then defined as Finally, partition functions are used as in Ref. [60] where σ represents the cross section for the process. These partition functions are then used to weight the corresponding matrix elements of points lying in R div , and we train one network in this region for each S i,j (see Section 3.2 and Ref. [60]). The above definitions are appropriate for the processes we studied here as they account for the different singularity structure than that found in the case of e + e − (i.e. they explicitly include the initial-state singularities). To allow for easier generalisability to other processes, we include all pairs of initial-and final-state particles in our implementation, including the {γγ} pair which is redundant as it does not exhibit the relevant singularity structure. This redundancy does increase the computational time required; however, we find the performance of the NN ensemble is not adversely affected. The above implementation could therefore be simply generalised to the e + e − case, although again with some redundancy.

Neural network setup
Now that we have described how we partition the phase space for data generation, we will discuss the NN setup, and the data processing required to train and test our methodology, in more detail. While the focus of this paper is the use of the NN ensemble method originally presented in Ref. [60], for completeness we present a comparison of this method against a naive (single network) approach in Appendix C.

Data
The sampling of phase space is dependent on the integrator. Unless otherwise specified, the same integrator is used for training, validation and testing. We generate the datasets from two runs of the integrator: the first is divided into training and validation datasets according to a 80:20 split; the second uses a different random seed than the first and is used for the test dataset (note that this second stage is only performed when evaluating the performance of our ML approach). Phase-space points, and their corresponding matrix elements generated by NJet, are extracted during these stages after the cuts have been applied. For consistency with Ref. [60], we train on 100k points and test on 3M. Input data consists of the initial-state 4-momenta, x ∈ R 4n , and the target data is the corresponding (weighted) matrix element, y ∈ R.
Once the training data has been generated, we split the data into divergent and nondivergent regions according to Equation (3.1) and Equation (3.2), and create copies of the points in R div , each weighted by S i,j for (i, j) ∈ P FKS , which are used to train a different network in the NN ensemble for each soft and collinear divergent structure.
Before training the NN ensemble, we preprocess the data by standardising each variable input, i.e. we ensure that the training and validation input distributions for each variable have a mean of zero and a standard deviation of one. The output variable distribution is similarly standardised. During the testing phase, we insure the testing data is standardised to the training and validation dataset distribution parameters as is customary in ML literature. Once the model has been used for matrix element estimation, the output is destandardised to obtain the final value. Varying types of data processing are used for hyperparameter tuning (see Appendix A for more details).
All our simulations use √ s 12 = 1 TeV; the methodology is agnostic to this choice.
Although we test the performance of our models on different cuts, unless otherwise specified all models are trained, and analyses performed, using the following kinematic cuts adapted from those in Ref. [14] p T,j > 20 GeV where p T = p x 2 + p y 2 (beam along z-axis) is transverse momentum magnitude, R = (∆η) 2 + (∆φ) 2 is isolation cut cone radius, η is pseudorapidity, φ is azimuthal angle, γ denotes a photon, photons are ordered by p T , and jets, j, are identified through the anti-k T algorithm [70] implemented in FastJet [71] with R = 0.4. These cuts are typical for LHC analyses. Photons are selected by smooth cone isolation [72] such that all cones of radius Matrix elements are evaluated with renormalisation scale µ R = m Z with physical constant values α(Q 2 = 0), α s (m Z ), and m Z from the PDG [73]. Since the one-loop process is LO, the full amplitude is finite and has µ R dependence in the couplings only.

Architecture
For optimal results, a different NN architecture construction would be fine-tuned to each choice of setup parameters, e.g. integrator, cuts, and process. However, the required computational and time resources required to perform this optimisation make this highly impractical. Instead, we further test the generalisability of the hyperparameter choices made in Ref. [60] to this new set of processes. In addition, hyperparameter tuning was performed to assess how optimal this approach is relative to the ideal scenario where hyperparameters are process specific, and found the original setup to be among the most optimal (see Appendix A for more details).
In summary, we use the same fully-connected NN architecture for every network in the ensemble. These are parameterised using Keras [74] and a TensorFlow [75] backend, with the number of input nodes equal to n×4. 1 The hidden layers comprise of 20-40-20 nodes and there is a single output node. All hidden layers use hyperbolic-tangent activation functions and the output node has a linear activation function. A mean squared error loss function is used, and the network is optimised using Adam optimisation [76]. Finally, the number of training epochs is determined through Early Stopping (see Section 8.1.2 of Ref. [77]), tracking the validation loss with no minimum change requirements. As in Ref. [60], to minimise the limiting effects of using a validation set containing only 20% of the original training set, we train with a patience of 100 epochs.

Interfacing with event generators
Assessing performance after interfacing with existing event generator technology is important for demonstrating 'real-world deployment' of ML algorithms in particle physics simulations, as it exposes the model to a range of post-inference effects which may alter the final reliability of the model. For example, generators allow for the easy implementation of complex phase-space cuts, jet clustering algorithms, phase-space and PDF weights, as well as different integrators and integration optimisation routines.

The interface
Event generators are largely written in C++ for computational efficiency. Therefore, after the model has been trained, the weights of each NN are extracted and written to file. A C++ program reads these models files and performs the linear algebra operations required during the inference step using Eigen [78]. This means the Python libraries used for model training are circumvented, and the call time for model inference is reduced, while keeping everything in C++ simplifies the interfacing of the model with standard event generators.
Given a set of 4-momenta, a custom C++ interface provides the helicity-and coloursummed matrix element to Sherpa. This can be used to call NJet evaluations through a BLHA interface [79,80] or to call the model inference result. Rivet [81,82] is then used for analysis, using a script adapted from the reference analysis of Ref. [83].

Phase-space integration
Phase-space integrators seek to achieve increasingly optimal rates of integration convergence though the careful sampling of points. While the choice of integrator can affect the overall rate of convergence, it also determines the placement of phase-space points which directly feeds into the distribution of points in the training dataset.
Since these processes are loop-induced, for simplicity we use the RAMBO integrator [84] throughout for event generation. However, we test different approaches to generating the integration grid. The first we term the 'unit grid' which is constructed by running the grid optimisation step while returning a unit value in place of the matrix elements. This effectively removes the dependence on the optimisation procedure and, since RAMBO is used, ensures a uniformly and isotropically sampled phase space. The second uses VEGAS [85,86] optimisation when generating the integration grid, thereby putting a preference on sampling regions of particular importance to the cross section. We share the integration grid between training and testing phases, meaning this importance sampling is reflected in both. Given the expense of matrix element calculation for the 2 → 4 scattering process, we reserve the use of VEGAS optimisation only for the 2 → 3 case.

Weights
When training the models, we do not include explicitly any event generator effects. All additional weightings, i.e. phase-space weights and PDF weights, are introduced after the model has been used for inference as is done for other matrix element generators. The addition of these weightings has the potential to be problematic for model performance: when the model is trained it is unlikely to learn all regions of phase space equally well and there is a chance that those regions in which the model has poor performance could be amplified by these additional weighing factors.
In order to test for this we include PDF weights using the LHAPDF library [87] and the NNPDF3.1 set NNPDF31_nlo_as_0118 [88] as well as phase-space weights which depend on the integration grid optimisation method.

Reweighting
The approach used in this paper to train the NN ensemble provides good agreement between the network output and that of NJet. However, the ensemble approach will always be an approximation and is subject to perform poorly in certain regions of phase space, especially those in which it has not been trained on or in which training data is sparse. As a partial remedy to this, we propose the idea of reweighting the event weights with known matrix element values derived from NJet. When using weighted event generation this can either be performed after event generation or can be done 'on the fly' at the interface level. This former approach is possible since the original event weight can be recovered through where w NJet E,i and w NN E,i are the event weights using NJet and the NN ensemble respectively for a given phase-space point i, and M NJet i and M NN i are the associated matrix elements. As the ratio NJet/NN is not known a priori, we must construct criteria on which to reweight. Specifically, we explore the following: 1. A random sample of points (e.g. 10%) regardless of where they are in phase space; 2. A priori stating which regions of phase space in which to reweight and then doing so either randomly or over the entire region; 3. Using the NN uncertainties to inform reweighting, e.g. points with large uncertainties are reweighted.
There are several factors informing which approach is the most appropriate. The first of these is the added compute time required: all of these techniques necessitate the calculation of the matrix element by an analytic or numerical evaluator and therefore limit the desirable number of points to reweight. The second is the performance gain and confidence in the output in certain regions of phase space. If the analysis being performed is specific to an under-sampled region of phase space, such as distribution tails where the network may under-perform due to divergent structures in the matrix element, this could be an especially important region in which to reweight. However, if general process explorations are being performed, meaning all distributions and cross sections are of relative equal importance, then a less restrictive reweighting on regions of phase space may be optimal.
In this paper, we explore the application of reweighting to the processes described in Section 4. While reweighting is not always found to be necessary given the performance of our methodology, we demonstrate how it can be applied and discuss which reweighting criteria show the greatest performance gain.

Results
In this section, we present the results of our experiments for the 2 → 3 and 2 → 4 gluoninitiated diphoton amplitudes. As the former is significantly less computationally expensive, we use this for a deep analysis and exploration. The proposed pipeline for using our ML set up and interface with event generators is as follows: 1. Generate an integration grid; 2. Use this with a matrix element provider to generate training and validation datasets; 3. Train the model; 4. Use the model to estimate the values of the remaining phase-space points for event generation while using the same integration grid; 5. Reweight as necessary; 6. Obtain final results.
To assess performance, we also evaluated matrix elements with NJet in parallel with the models, with different random seeds.

gg → γγg
First we investigate the performance of our methodology on the loop-induced gg → γγg process. Following the procedure outlined above, we use a unit integration grid, choose a random seed to generate the training and validation datasets with, and use another to infer on the trained model. Figure 2 shows the performance of our trained NN ensemble at the matrix element level, here represented as the ratio of the model inferred values to the NJet evaluations. The errors form a narrow and approximately symmetric unit-centered distribution, thus demonstrating that the ensemble method has a reasonable per-point accuracy. The slightly elongated right tale of the distribution is due to large matrix element values in highly divergent regions of phase space, yet these points are in the minority.
Once the ensemble is trained, it is converted to be called by the event generator interface which allows for the calculation of the cross section and differential distributions.  Figure 2: NN/NJet errors for the 2 → 3 scattering process using a unit integration grid.
shows the results of the cross section derived using NJet and the NN ensemble. We see that these two approaches are in excellent agreement, with the ensemble result overlapping within one standard deviation of that calculated by NJet. The errors on the NJet values are the MC errors, and the errors on the ensemble are precision/optimality uncertainties. The latter are calculated by training multiple ensembles with different random seeds in the weight initialisation, and in the shuffling of the training and validation datasets. MC errors are quoted to one standard deviation and the precision/optimality uncertainties to one standard error on the mean. A more in depth description of this uncertainty analysis can be found in Section 2.3 of Ref. [60].
The error plot and cross-section calculation provide good evidence for the performance of the NN ensemble method both in its ability to learn the distribution of phase-space points on average, as well as its robustness to being integrated into a wider event generation framework with additional phase-space and PDF weights. To further test the methodology in a more relevant way to how it would be used in practice, differential distributions can be used to assess robustness as they more explicitly expose performance on the divergent and tail events.  Figure 3: Differential distributions normalised to the cross section for the 2 → 3 process comparing NJet (red) with the NN ensemble (blue). The NJet results are quoted with MC errors, and the NN results with precision/optimality uncertainties calculated as described in Ref. [60] but which are negligible in comparison. Pseudojets j i and photons γ i are ordered by energy, ∆φ is azimuthal separation, R-separation is defined in Section 3.2.1, and m γ 1 ,γ 2 and ∆η γ 1 ,γ 2 are the mass and pseudorapidity separation of the diphoton system.  Figure 4: Effect of reweighting points in the divergent region of phase space, R div , on the ratio between the reweighted cross section, σ (RW) , and the cross section calculated using NJet σ (NJet) for the 2 → 3 process. In this case, the divergent region comprises approximately 7-9% of the total phase space (see Appendix B for details). The red band shows the MC error on the NJet result. Figure 3 demonstrates the performance of the NN ensemble in comparison to NJet in six differential slices of phase space. These include p T , angular, and diphoton system distributions which have been chosen to give a range of realistic constructions exploring different regions of phase space. In general, the NN ensemble is found to be in good agreement, particularly around the peaks, with the majority of the NN bin values being with the NJet MC error. The normalised NN uncertainties on the differential bins is negligible in comparison to the MC error. Strong performance is pronounced in the pseudorapidity distribution which shows variation at the percent level. The p T and angular distributions show more fluctuations in the tail events, with the diphoton mass demonstrating the greatest deviations in these regions. However, despite these differences, fluctuations are clearly statistical rather than systematic meaning agreement will increase as the bins are aggregated. This is to be expected given the strong cross-section performance.
The results presented so far have been derived from a NN ensemble trained and tested on the same integration grid and on the same cut parameters. However, in phenomenological explorations it is common to study a range of cut parameters, especially when measuring the effects of new phenomena. Since the NN ensemble performs well at the per-point level (as shown in Figure 2), it should also be able to generalise to different cut parameter configurations. Specifically, the ensemble should still be applicable to harsher cuts than those used in training because the it expects the training and testing datasets to be drawn from the same statistical distributions. However, in the event that cuts are relaxed in comparison to those the model was trained on, reweighting could be employed for the relevant additional subset of points thereby guaranteeing the expected values in these 'unseen' regions of phase space. Table 1  NN ensemble with harsher cut values than the baseline. The agreement between the two approaches is comparable to the agreement found before the additional cuts were added, thereby suggesting good generalisability. Indeed, this is not surprising since the points with the largest errors between the NN and NJet were the most divergent points and therefore the ones more likely to be cut given the IR singularities present in these processes.
The generalisation to additional cut parameters both demonstrates the robustness of this training regime, as well as the practical gain in not having to retrain a network for each specified set of cuts. This allows us to generalise the training and testing procedure outlined at the beginning of this section to suggest that the NN ensemble be first trained on more relaxed cuts and then, as iterations of harsher cut parameters are explored during analysis, these can be applied without the ensemble significantly decreasing in performance. If cuts are to be relaxed then reweighting could be used to ensure good performance at the expense of compute time.
While the network performance has been shown to be strong overall, other reweighing methods can still be explored. Reweighting randomly across all phase space, even at the 20-40% level, was not found to significantly reduce the difference in the computed cross sections. Similarly, the NN ensemble uncertainties were not found to be correlated with the errors, and so were discarded as a good reweighting criteria. As mentioned above, the points in which targeted reweighting can be most beneficial are those which fall within the divergent regions of phase space. Figure 4 presents the results of reweighting points randomly in R div (as defined in Equation (3.1)), and shows an improvement in the cross section -reweighting a greater number of points enables the reweighted cross section, σ (RW) , to converge to the value calculated by NJet, σ (NJet) . Indeed, to achieve almost equal values in the cross sections, the total proportion of phase space requiring reweighting is at the percent level. Therefore, we find reweighting in the R div region of phase space and/or when relaxing cuts in relation to those used during training can improve model performance.
Finally, although the cross section and differential distributions provide a means to test the robustness of our approach against the additional weights introduced during event VEGAS grid gg → γγg ∆ = NN NJet Figure 6: NN/NJet errors for the 2 → 3 scattering process using a VEGAS optimised integration grid.
generation, we can more explicitly single out the effects of the PDF weights by calculating the NN ensemble error as a function of the momentum fractions, x 1 and x 2 , of the initial state partons. Figure 5 shows the root mean squared error (RMSE) of the NN as a function of these variables, along with the frequency of points as they appear in the training dataset. As expected, the ensemble performs better in locations with more points, and we only see the RMSE grow more significantly in the regions of low-statistics. Since the gluon PDF falls off as x approaches one, and peaks in the low x region, this provides another test of ensemble robustness during the external introduction of PDF weights.

Aside: VEGAS grid optimisation
The results presented so far have used a unit integration grid and RAMBO integrator in order to be process agnostic in the phase-space sampling. As mentioned in Section 3.3.2, however, it is common to use importance sampling and other optimisation techniques to speed up integration conversion. To test the robustness of our approach to these alternative integrators, we use VEGAS during the optimisation grid generation stage. Figure 6 shows the error plots for the 2 → 3 scatting process using this optimisation setup while keeping all other parameters setup parameters fixed. Here, we see that the shape exhibited in the error plots is similar to that of the unit grid shown in Figure 2, although slightly broader around the peak. This is likely due to the the larger number of points placed in the divergent regions by the VEGAS integrator. The cross section was also found to be in excellent agreement, with NJet giving 4.151 × 10 −6 ± 1.1 × 10 −8 pb, and the ensemble giving 4.22 × 10 −6 ± 8 × 10 −8 pb.

gg → γγgg
We now turn to investigate the gg → γγgg process. Analytic expressions for this process are not available and the numerical implementation is significantly more computationally expensive than for the equivalent 2 → 3 process (see Section 4.3). Integration grid optimisation is therefore highly inefficient, and so for the remainder of this section a unit grid will be used. To test generalisability, the NN setup is as in Section 4.1, with the only change being in the chosen value of y p = 0.001. At higher multiplicity, a greater proportion of points fall within the divergent region, R div , however, this can hinder model performance by unbalancing the training regime. It is therefore reasonable to aim to keep the proportion of points in this region approximately constant throughout our experiments which is achieved by lowering the value of y p (see Appendix B for more details). Figure 7 shows the performance of our trained NN ensemble at the matrix element level. As expected, the performance has decreased relative to the 2 → 3 process shown in Figure 2, yet the error distribution is still found to be approximately Gaussian, although with a shifted mean. Despite this, the cross section calculated using the NN ensemble -4.5 × 10 −6 ± 6 × 10 −7 pb -is found to be in excellent agreement with that derived from NJet -4.9 × 10 −6 ± 5 × 10 −7 pb. This suggests that although there are several points where the ensemble approach performs poorly, particularly in comparison to the 2 → 3 process, these are largely in the divergent region and found to not affect the cross-section calculation too greatly. Figure 8 shows the performance of the ensemble approach in six differential slices of phase space. As in the previous example, the ensemble is found to perform well relative to NJet: while noise in the tails of the distributions is still observed, these appear to be reduced in comparison to the 2 → 3 process. This further supports the assertion that the points where the ensemble performs poorly are suppressed.
Given the difference in cross-section values calculated using NJet and the ensemble approach, we perform reweighting in the divergent region as discussed in Section 3.4 and Section 4.1. As shown in Figure 9, reweighting in this region can bring the NN ensemble derived cross section closer to the value calculated using NJet. In the case of the 2 → 4 process, the MC error on the NJet result is significantly larger for the same number of points compared to the 2 → 3 process. Given these larger error, and that the ratio σ (RW) /σ (NJet) resides within these errors, it is predictably noisy, yet still converges showing that this approach to reweighting can be generalised across multiple processes.   Figure 9: Effect of reweighting points in the divergent region of phase space, R div , on the ratio between the reweighted cross section, σ (RW) , and the cross section calculated using NJet σ (NJet) for the 2 → 4 process. In this case, the divergent region comprises approximately 2-3% of the total phase space (see Appendix B for details). The red band shows the MC error on the NJet result.

Timing
We repeat the performance evaluation of Figure 1 with methods involving error estimation as these are likely to be employed in real-world usage. For conventional techniques, the dimension scaling test is a standard way to estimate error on the result and introduces a second matrix element call for each phase-space point evaluation. As discussed in Section 4.1, we propose running 20 NN ensembles for each point to obtain a mean with standard error. The results, shown in Figure 10, demonstrate the per-point speedup in using amplitude NNs in practice. For the 2 → 4 process, where amplitude calls dominate conventional simulation time, a 10 4 times speedup in amplitude calls is observed which renders the inference stage as negligible in the total time of our NN-based simulation pipeline. Indeed, in comparison to the numerical calculation of the matrix element, the training time of the NN ensemble can also be considered negligible, meaning the total speed up in the overall simulation time is of the order N infer /N train -the ratio of the number of inference point to the number of points in the training dataset.

Conclusions
In this article we provided further evidence that NNs provide a general framework for the optimisation of high multiplicity observables at hadron colliders. We extended preliminary studies [60] for electron-position scattering to hadron-hadron collisions and provided a general interface to the Sherpa MC event generator for NNs trained with the NJet amplitude library. We focused on the loop-induced processes gg → γγ + n(g) which cause problems for conventional phase-space generation methods and require the computation of expensive scattering amplitudes.
We trained an ensemble of NNs which divide the scattering amplitudes into IR divergent sectors according to the FKS mapping and find excellent agreement between distributions generated with the networks and those generated with the conventional approach. Errors from the NN were included through variations of training parameters and show good agreement with the direct comparisons. We also showed that by reweighting the generated events according to their divergence structure, the accuracy of the simulation could be improved at a rather low computational cost. This step also provided additional confidence in the inference of the network.
We saw, especially for the 2 → 4 process, a good improvement in the total simulation time. Since the calls to the scattering amplitudes dominated the total time, the speedup was given by the ratio of the number of points used in training to the total number of calls used in the full simulation (during event generation). This came out around a factor of 30. While this was good to see, it is not the limit of the optimisation. If the trained networks can be used for many subsequent simulations with different kinematic cuts, the overall improvement would be much greater. We showed that our networks reproduce distributions with different cuts in the transverse momentum without the requirement for retraining, which was very encouraging. Additional improvements to the inference stage (event generation with the network) could be provided through GPUs which could be important if a large number of variations in the cut analysis were required. The distribution of the trained networks was also simpler than the large quantity of data generated with Root Ntuples [89,90], another technique for optimising the information that can be extracted from expensive simulations.
There remain open questions of course. It would be very interesting to apply the technique to the more intricate problem of real radiation event generation since NLO and NNLO simulations are often dominated by these contributions. It may also be beneficial to make connections between the amplitude-level approach taken in this article and those focusing on phase space or complete simulation including parton showering and even detector simulation.
We hope that these studies will help to develop a general framework that can be used in future experimental analysis.

A Hyperparameter tuning
Hyperparameter tuning was performed on a dataset of 1M points (derived independently from the datasets used for validation and testing in Section 4) to explore optimal data processing and model parameter choices. Given the computational expense of generating data, this was only done for the 2 → 3 process.
We tested different model architecture constructions (changing the number of hidden layers and/or the number of nodes in each hidden layer), data preprocessing methods, and model loss functions. All other training parameters are as described in Section 3.2. For data preprocessing methods, we tested input variable standardisation, i.e. the training and validation data input variables are each standardised to have zero mean and unit variance, and normalisation, i.e. the training and validation data input variables are each normalised according to min/max normalisation where x ∈ X train , X train ⊂ R is the set of training data for a given input variable, and x * is the input variable normalised from x. This procedure means the dataset is normalised such that x * ∈ [0, 1] and therefore encourages a positive-definite output. When using the standardisation preprocessing step, we use hyperbolic-tangent activation functions in the hidden layers, but for normalisation we use rectified linear units (ReLU) [91]. This latter choice is to further encourage a positive-definite output and also aims to increase the rate of convergence. It should be noted that a clear limitation of the positive-definite conditioning of the normalisation procedure is a reliance on the following conditions: where X test ⊂ R is the set variable inputs derived from the testing data, and therefore X train ∪ X test represents the combination of the training, validation and testing sets. Since the performance gain from using the ML approach is that the training and validation sets combined are much smaller than the testing set, the above conditions are likely to break down as n(X train ) n(X test ). Two model loss functions were tested during hyperparameter tuning. The first was the mean squared error (MSE) where n is the number of training points, f : R d → R is the function describing the neural network, x i is the i th d-dimensional input data (here d = 4n), and y i the corresponding target variable. The second is the mean squared logarithmic error (MSLE)  (log(f (x i ) + 1) − log(y i + 1)) 2 . (A.5) Given the problem of approximating matrix element values for complex scattering processes, the target variable can take on a wide range of values spanning several orders of magnitude. These large values can sometimes be especially important to the cross section and so the MSE's penalisation of large outlier values can be beneficial; however, this might also make the training unstable. We included the MSLE during training to test if reducing the sensitivity to large scale variations in the target value is beneficial.
The results of the hyperparameter tuning can be found in Table 2. Here we see that using data standardisation with an MSE loss function generally produces better results, although there does not seem to be a clear dependence on the model architecture. Given these findings, we choose to train our models using data standardisation with hyperbolictangent activation functions, a MSE loss function, and an architecture of 20-40-20. This is consistent with that presented in Ref. [60]. gg → γγg gg → γγgg Figure 11: Proportion of the training dataset in the divergent region, R div , as a function of y p for the 2 → 3 and 2 → 4 process.
The choice of y p defines the partition between the divergent region of phase space, R div , and the non-divergent region, R non-div (see Equations (3.1) and (3.2)). While it may be assumed that having more points in each region is helpful since it provides more data for the networks trained in each region, this is not always the case. Including a mixture of points in the training dataset, with large imbalances in the distribution of different scales, can make the network optimisation procedure increasingly noisy. For this reason, we seek to choose a value of y p which provides a balance between having enough divergent points to learn those regions well, whilst not providing too many points not in this limit and which share similar scales to points in the non-divergent regions of phase space.
To be consistent with prior work [60], we initially chose y p = 0.02 although the number of points falling into R div depends on the multiplicity of the process. As presented in Section 4.1, this value was shown to perform well, 2 yet the same value would place a significantly greater proportion of points into the divergent region when another external leg is added (see Figure 11). Instead of choosing the same value of y p for all processes, we aim to select a value which keeps the proportion of points in the divergent region at the level of 2-8% of the whole phase space sampled. We choose a value of y p = 0.02 for the 2 → 3 process, and y p = 0.001 for the 2 → 4 process. 3

C Comparison with the naive setup
Throughout this paper, all results presented using a ML approach have used the NN ensemble methodology. This approach had been shown previously to outperform a naive single NN trained over the whole of phase space for e + e − collisions [60]. In particular, the motivation for this approach was enhanced performance in handling real emission, IR  singular regions of phase space, which similarly occur in the processes studied in this work, especially at high multiplicity. For completeness, we perform a similar comparison on the 2 → 3 gluon-initiated diphoton processes; we do not compare on the 2 → 4 process as it is computationally expensive to do so and it is a natural higher multiplicity extension of the 2 → 3 process. Figure 12 shows the matrix level error analysis of the 2 → 3 scattering process using both a unit and VEGAS optimisation grid. In both cases, the error distribution for the single NN approach has a significantly broader character than the ensemble method. This demonstrates that the findings of described in Ref. [60] are consistent with those presented in this study.