Using neural networks for efficient evaluation of high multiplicity scattering amplitudes

Precision theoretical predictions for high multiplicity scattering rely on the evaluation of increasingly complicated scattering amplitudes which come with an extremely high CPU cost. For state-of-the-art processes this can cause technical bottlenecks in the production of fully differential distributions. In this article we explore the possibility of using neural networks to approximate multi-variable scattering amplitudes and provide efficient inputs for Monte Carlo integration. We focus on QCD corrections to $e^+e^-\to$ jets up to one-loop and up to five jets. We demonstrate reliable interpolation when a series of networks are trained to amplitudes that have been divided into sectors defined by their infrared singularity structure. Complete simulations for one-loop distributions show speed improvements of at least an order of magnitude over a standard approach.


Introduction
Improvements in the precision high energy collider experiments are putting increasing pressure on theoretical predictions. The latest technology for the evaluation of scattering amplitudes, the handling of infrared singularities and Monte Carlo event generation has been able to achieve an impressive range of predictions for differential observables at NLO and NNLO in both QCD and EW coupling expansions. Despite the successes, simulations at the cutting edge of the precision frontier are often extremely computationally expensive.
In this article we explore one way in which popular computer science technology can be used to decrease the computational cost of precision simulations. In particular, we consider high multiplicity scattering processes, with extremely high mathematical complexity, where it is less clear how to make use of conventional interpolation methods such as polynomial fits and interpolation grids [1][2][3][4][5]. Neural networks, however, are by now a standard tool within the data analysis, data science and machine learning communities and offer a general, nonlinear parametrisation which have the potential to approximate any continuous function [6], and therefore could be useful in the context of high multiplicity scattering.
The basic principle is not new of course. Neural networks have the potential to provide extremely fast and lightweight approximations of complicated amplitudes. In Figure 1 we . This demonstrates the very obvious fact that the neural network is fast to call and has a very mild dependence on the number of variables. The challenge is to train the network well enough that it can be interpolated and extrapolated reliably over a complete range of differential observables. demonstrate this for the particular test cases which are the subject of this article, the treelevel and one-loop amplitudes inside the Njet amplitude generator [7] for e + e − →≤ 5 jets. While the potential speed up in the function call is quite striking, the real challenge is not clear from this analysis. The actual improvement in CPU cost must take into account the time taken to train the network to a level that it can be interpolated and extrapolated accurately and reliably.
Previous attempts have been made to use machine learning tools such as Boosted Decision Trees (BDTs) and neural networks for efficient phase-space sampling and Monte Carlo integration [8,9] with recent work [10,11] focusing on the use of coupling layers [12]. Similarly, work such as that of Otten et al. [13] makes use of neural networks for explicit cross-section prediction. Here, the authors focus on pp → 2 jet processes, and implement an Artificial Neural Network point selection (NNPS) scheme for selecting training data based on the points the network struggles to learn the most. In addition, there has been much work on the use of Generative Adversarial Networks (GANs) [14], and other generative models, for event generation [15][16][17][18][19][20][21], while there has been little work addressing the issue of explicit matrix element approximation [22].
In this paper we design a deep learning pipeline to approximate e + e − →≤ 5 jet matrix elements at both LO and NLO, thus exploring processes with significantly higher multiplicity than those previously considered. While [13] uses a more automated approach for phase-space sampling to aid in training a neural network, we employ physics-based knowl--2 -edge of the processes in designing our pipeline and analyse the effectiveness of this approach and what this might tell us about the phenomenological set up. We pay careful attention to the errors and uncertainties in our neural network approximation, and offer a comprehensive implementation of neural network regression analysis.
For usability, we supply code to accompany our methodology and results [23].

Computational setup
We use the on-shell based C++ code Njet [7], to evaluate colour and helicity summed born and virtual matrix elements for e + e − →≤ 5 jets, denoted M (n,0) and M (n,1) respectively.
Njet uses integrand level reduction [24] and generalised unitarity [25][26][27][28][29][30][31] to construct loop amplitudes from tree-level input computed efficiently with Berends-Giele recursion [32]. For a given phase-space point, Njet calculates the virtual and born matrix elements, along with the 1/ and 1/ 2 correction coefficients, from which we can calculate the k-factors: For ease of use, Njet is interfaced with via the Binoth Les Houches Accord (BLHA) [33], which is designed to provide a standardised interface between Monte Carlo tools and matrix element programs.
We explore the performance of various neural network parameterisations of the amplitude for total and differential cross-section computations at LO, as well as their corresponding k-factor equivalents at NLO. We find that as the multiplicity increases, infrared singularities on the edge of the phase-space increasingly cause problems for a single neural network, which struggles to find a good fit across the whole phase-space. To improve the approximation, we divide up the phase-space into sectors according to the subtraction method developed by Frixione, Kunszt and Signer (FKS) [34,35]. Although we do not actually perform subtraction, this phase-space decomposition isolates the infrared singularities and allows the training of networks focused on improving their performance on each partition individually.

Phase-space partitioning
We explore two pipeline configurations: i) we naively train a single network over all sampled points in phase-space; ii) we divide the phase-space into divergent and non-divergent regions in an attempt to partially isolate the infrared singularities and then further sub-divide the divergent region according to the FKS subtraction method, training one network on the nondivergent region, and a different network on each partition. For clarity we will generally refer to the single network, and ensemble of networks, as 'models' and the individual networks comprising these models as 'networks'.
We parameterise our phase-space according to the Lorentz invariant y ij = s ij /s com , where s ij = (p i + p j ) 2 and s com is the centre-of-mass energy of the incoming particles, and define all cuts with respect to this quantity. Let the global phase-space cuts be denoted y cut and the partition dividing divergent and non-divergent regions be at y p . Using these -3 -two scales, the divergent region, R div , and the non-divergent region, R non-div , are defined as follows: where p is a phase-space point consisting of the initial state 4-momenta, p a and p b , and the outgoing momenta, {p 1 , p 2 , ..., p n }, where n is the number of jets.
In the FKS subtraction formalism, the phase-space is divided such that the kinematic regions resulting from each partition contain only a specific subset of singularities. In order to achieve this, a set of ordered pairs, known as FKS pairs, are introduced. In our case of e + e − →≤ 5 jets we define these as: where n g is the number of gluons in the process. We then construct a partition function similar to that of [36,37] (for a brief introduction to different FKS pair definitions and partition choices see Appendix B): such that: where, in this example, σ (X) represents either the Born cross-section, σ (B) , the virtual correction, σ (V ) , or the k-factor, σ (K) .
To demonstrate this partitioning effect we analyse the process e + e − → qqg. Here, we can isolate each of the two FKS pairs {qg,qg} and weight all phase-space points in the divergent regions according to the behaviour of S i,j for each pair. The first pair corresponds to either the quark and gluon going collinear or the quark or gluon going soft. Since we cannot have soft quarks, this FKS partition only contains the singularities for the soft gluon and collinear quark and gluon. The behaviour of the FKS partition function, S q,g can be clearly seen in Figure 2, where we observe increasingly highly weighted points as s qg approaches 0.
An advantage of this method is that the interpolation between singular regions is smooth since they add together to produce the overall cross-section (see Equation 2.6). 1 By Figure 2: Behaviour of the S q,g FKS partition function weighting the matrix elements in this way, phase-space points closer to the q||g singularity contribute with increasing significance to the corresponding neural network's loss during training. A similar analysis can be performed for the second FKS pair in this process.
Since the FKS pairs are ordered, the upper bound on the number of pairs for our processes is: where n j is the number of jets and the −1 comes from the fact that {qq} is not an FKS pair by definition. It should be noted that the number of pairs can be reduced in reality due to the symmetric behaviour of all gluon-gluon, or quark-gluon pairs; however, for simplicity we partition into N max regions. For example, in the example above, N max = 2 but N = 1 since the behaviours of the two pairs in this process are identical.
After using the FKS partition function to divide the region R div , we are left with N max + 1 regions in total across which we train the same number of networks. We find that setting the scale to y p = 0.01 is generally applicable to all processes analysed.

Neural network setup
We compare the performance of two neural network setups, firstly a singular network is trained over the entire uniformly sampled phase-space, and secondly an ensemble of N max +1 networks are trained over the partitioned phase-space.

Data
The phase-space is uniformly sampled using the RAMBO algorithm [38], with each point initially having a weighting of unity. At LO, we train the single network model on data generated from sampling over the entire phase-space uniformly, whereas we train the ensemble model on samples drawn equally from the divergent and non-divergent regions. 2 At NLO, due to the computational expense of virtual matrix element calculation, the phasespace is uniformly sampled as a whole and then divided into R div and R non-div regions after sampling. RAMBO was chosen for its simplicity, the ease with which it can be altered to our specifications, and because it highlights interesting pitfalls and difficulties in highdimensional functional approximations (see more on this below). In total we generate 500k phase-space points for training at LO, but only 100k at NLO due to the complexity of the problem.
The infrared poles in the matrix element result in singularities. Neural networks for classification tasks have been repeatedly shown to perform better when datasets are balanced, thus helping to avoid bias in the classification. Balancing can be done through a variety of methods such as over and under sampling, as well as loss function weightings. In regression tasks, the equivalent to class imbalances are under sampled regions that behave significantly differently to the rest of the sampled space. When doing explicit numerical calculations of the matrix element, these imbalances are not such an issue and their effect when calculating observables can be estimated by the Monte Carlo error and by phase-space resampling, yet they become significant when training a network. Through balancing the training datasets in the divergent and non-divergent regions, and using the FKS partitioning method as outlined above, we hope to address the issue of underrepresented regions.
Increasingly sophisticated non-machine learning based methods for phase-space sampling using adaptive methods [39][40][41][42][43], including the use of recursive stratified sampling [44] and integrand factorisation [45] have been developed. Similarly, importance sampling methodologies specifically designed for QCD antenna generation exist to better capture these divergent regions given the physical knowledge of the pole structure [46,47]. RAMBO, however, is indifferent to these variational differences in phase-space, giving a more naive sampling, yet the ability to construct an interpolation function from a uniformly sampled phase-space means we save computational time during the sampling stage. Although performance of our approximation may be increased using these more sophisticated methods, demonstrating sufficiently good results while requiring only the use of simple sampling techniques like RAMBO further shows the power of our method and the additional time savings it can offer.
Once the phase-space points are generated, we use Njet [7] to calculate the corresponding squared matrix elements at LO, and the virtual correction terms at NLO, for e + e − → Z * /γ → qq + n g . We calculate all quantities in the four-dimensional helicity (FDH) scheme, assuming all external legs to be massless, with the number of light quark flavours set to n f = 5, and we use the same renormalisation scale as in [48] (see more details in Section 2.3).
When training the network, the dataset is split in an 80:20 ratio for training and validation. Furthermore, independently generated, unseen datasets are used in testing.
Through generating many more points for testing than training we demonstrate both the performance of our methodology as an interpolation function, as well as its ability to capture the increasingly sampled divergent structure.
To avoid the problem of vanishing/exploding gradients, we standardise our data to zero mean and unit variance at each input node and across the targets.

Architecture
Choosing an optimal network architecture is non-trivial due to the large number of parameters that can be tuned to an array of criteria. It is common to approach a singular problem using a neural network and thus optimise the architecture for that process; however, while we want to demonstrate the ability of networks to become sophisticated multi-parameter interpolation functions, we require these models to generalise to a variety of processes.
For better generalisability we do not fine-tune a network to any particular process, but rather attempt to employ the same architecture for each process. The neural networks are parameterised using Keras [49] with a Tensorflow [50] backend and comprise of fullyconnected layers with an input layer of (n j − 1) × 4 nodes and output of 1, with three hidden layers made up of of 20-40-20 nodes. The hidden layers all use hyperbolic tangent activation functions and the output node has a linear activation function.
The loss function is taken to be the mean squared error, where n is the number of training points, f : R d → R is the function describing the neural network, x i is the ith d-dimensional input data, and y i the corresponding target variable. The network is optimised using Adam optimisation [51], while the number of training epochs is determined through Early Stopping (see Section 8.1.2 [52]), tracking the validation loss with no minimum change requirements. We recognise that by using a validation set containing only 20% of the original training set, we may be severely limiting the number of points in the increasingly divergent regions, thus skewing our Early Stopping criteria to the less divergent regions. In an attempt to mitigate this, we train with a patience of 100 epochs to measure effects in the loss function significantly later in the training regime; however, at NLO we found that this makes minimal difference to the total loss and so can be reduced to speed up network training.
The inputs to the network are the 4-momenta of n j − 1 jets. Since we fix the centre-ofmass energy for training, we sought to reduce the number of input nodes for more efficient learning. We note that further reductions in the number of input parameters could be made, yet in testing this had no significant effect on performance.

Uncertainty Analysis
The subject of error and uncertainty analysis in machine learning processes is receiving increasing attention (see [53,54] and the references therein), especially in the particle physics community [55][56][57][58], yet too frequently a demonstration of rigorous error analysis in machine learning regression processes is lacking.
-7 -As stated in [53], the main sources of error arise from approximation, aleatoric and epistemic uncertainties. 3 Since we are using deep neural networks, and have tested both deeper and shallower architectural designs, we assume that errors associated with approximation uncertainties are negligible. Additionally, we do not consider aleatoric uncertainties here since our data has been generated through high-precision numerical methods. Moreover, Njet accuracy tests have been performed to measure the stochasticity in matrix element generation and found this fluctuation to be negligible. 4 Following [55] we apply similar methods highlighted for use in classification networks, to this regression task. Specifically, we focus on the measurement of precision/optimality errors which include those arising due to epistemic uncertainties.
We measure model parameter initialisation dependence by retraining models on fixed training datasets while randomly reinitialising the weights. Depending on the observable, the standard deviation in the bins can be measured. Additionally, when sampling the phase-space, the Monte Carlo error is calculated; however, this does not fully account for the uncertainty in phase-space completeness. For this we bootstap the training data thereby resampling the phase-space multiple times and retrain the networks on different datasets, while keeping the weight initialisations fixed. Since in this paper we are comparing neural network output against Njet results, to avoid the double counting of errors we only include Monte Carlo errors on the Njet results.
The performance of our methodologies are also dependent on the test set chosen. For this we quote the Monte Carlo error, although it should be noted that the same issue with determining sampling completeness occurs here. Due to the computational expense of repeated generation of test sets we do not perform this, although the uncertainty bands on the neural network approximations should be sufficient to provide evidence of our methodology since these additional dataset dependancies are negligible given the large number of test points used and the relative size of the computed Monte Carlo error compared to the network uncertainties (see Section 3).
The errors on the network approximation that we calculate are therefore the error due to model initialisation dependence and error due to the size of the training dataset, which are added in quadrature. As noted by Nachman [55], additional sources of uncertainty are inherent in the network approximation that are hard to calculate explicitly, such as dependence on the model architecture (e.g. the number of hidden layers, nodes in each layer and the types activation functions used). Due to the size of the other errors mentioned, and the lack of currently available tools for their calculation, we do not attempt to incorporate errors arising from these uncertainties into our analysis. We quote Monte Carlo error only 3 Approximation uncertainty arises due to the model being too simplistic to allow for complex functional fitting, e.g. too few nodes or hidden layers in a neural network meaning the model isn't able to fit sufficiently non-linear functions. Aleatoric uncertainty accounts for fluctuations in the data distribution e.g. from measurement errors, and cannot be decreased by collecting more data from the same experimental setup. Epistemic uncertainties, on the other hand, account for uncertainties in the model, including lack of sufficient coverage of the data. 4 Njet accuracy tests are performed by inferring on each phase-space point twice and checking the difference in the results. The threshold is set to the default value of 10 −5 and errors arise due to lack of floating point precision and rounding errors.
-8 -for the testing dataset. When presenting our results, we calculate the mean of the multiple models trained and quote both the standard error on the mean, to demonstrate how uncertainty bands might be obtained in practice through the training of an ensemble of models where all models are used, and the standard deviation from the mean, showing the spread of the different trained models and the potential variability in performance if only a single model were to be trained. In an attempt to avoid confusion, we do not plot these uncertainties simultaneously, rather we are specific in figure descriptions as to which one we are plotting. Throughout this paper, we choose to train 20 models for each example, however, this number was chosen in a slightly ad hoc manner, since it gave a reasonable distribution of models, and should not be interpreted as a requirement.
Theoretical uncertainties are also prevalent in all of these calculations due to variability in setting the renormalisation scale, µ. Such uncertainties propagate though the networks since a model will learn to fit data at a certain scale. In this paper we train on data generated at a fixed scale as used in [48]. We perform the normal ad hoc scale variation of µ/2 and 2µ purely to determine the dependence of our methodology on such a scale choice. In doing so we found that the network is able to approximate the matrix elements at each scale equally well to within Monte Carlo error, and we therefore assert that the network performance is not highly dependent on the value of µ in the range we analysed. Moreover, since the goal of this work is not to calculate the cross-section or k-factors of a new process, but to provide tools for estimating such values for already known process, we do not quote these as uncertainties in our methodology.

Results
We test our methodology on estimating both LO cross-sections and k-factors for processes up to e + e − → 5-jets. In addition, various observables are plotted to demonstrate the applicability of our methodology to real calculations. In general, we see that neural network approximations demonstrate wide applicability to the cases investigated, with the FKS partitioning method giving more accurate and stable results due better approximating the infrared singularities.
It should be noted that to retain consistency between the training and testing phases, we sample both datasets in the same way, i.e. the test set used for single network inference has been sampled uniformly over the entire phase-space, whereas the set used in the ensemble case has been split into divergent and non-divergent regions and uniformly sampled equally and independently in each (as described in Section 2.1). This means that, while the ensemble approach has received a greater number of divergent points during training at LO, by sampling the test set in the same way we hope to make the tests equally 'difficult' in proportion to the sampling method used. Throughout all tests, phase-space cuts at y cut are used to regulate the infrared divergences.

Approximations at LO
Although leading order calculations are not significantly computationally expensive, they pose interesting test cases for neural network approximations of high multiplicity processes with many scales and complex infrared singularity structures. Moreover, we find that much of wha can be learnt from the performance of the models here can be applied to the NLO case.
As detailed above, we compare a single network trained over the entire phase-space with an ensemble of networks each trained on N max +1 partitions of phase-space. In determining the appropriate value of the global phase-space cut parameters, y cut , we evaluate the performance of our models by calculating the ratio of the network output to the Njet calculation as well as the network's ability to approximate the cross-section. Figure 3 shows the distribution of the neural network errors by calculating the ratio of the model output and the Njet result at each phase-space point in the test set. Since the ensemble of networks gives much narrower and more Gaussian shaped distributions than the single network approach, we can clearly see that this method is preferable at the level of per-point accuracy. Additionally, the error distributions of the ensemble approach -10 -are also more closely centred on zero, in comparison to the single network approximation, thus suggesting that the ensemble will also produce a better overall average performance as well. Note that these plots to not contain any information about the relative uncertainties attached to these model outputs, which we will discuss below.
While the error plots demonstrate the per-point performance of the models, we also wish to compare their performance in calculating physics observables while also taking into account uncertainty in the data and the model setup. Figure 4 shows the approximated cross-sections of the naive and ensemble networks as compared to those computed from the Njet matrix elements. As expected, we see a harsher y cut value at 5-jets better regulates the divergent regions, thus improving both the single and ensemble network approaches; however, this harsher cut is not fully necessary in the ensemble case as the Njet result sits on the edge of the neural network uncertainty bands.
When approximating the cross-section, we find the uncertainty bands to have very little noise and follow the shape of the average result closely. As we use a mean squared loss function for training, it can be shown that the network will tend towards learning the average of the target distribution (see Appendix A). Since no model will perfectly learn this distribution, for each model there will be an offset, , between the final trained network average and the true distribution average. As the cross-section is proportional to the average over the phase-space, for any value of n, this offset will manifest itself as a distance away from the true cross-section such that: where σ P and σ N are the predicted and Njet calculated cross-section values respectively, and θ is a small noise parameter. This therefore explains the relatively fixed distance between the model unceratinty upper and lower bounds and the Njet result.
Another result of Equation (3.2) is that, unlike Monte Carlo error, inferring on more test points will not reduce the model uncertainties, since these uncertainties are intrinsically tied to the training set and the model initialisation. Any efforts to reduce these errors should therefore be focussed on addressing such uncertainties, as we do by developing our ensemble method, rather than focussing on the test dataset.
In general, the global cuts required for the ensemble approach to be within the Monte Carlo error of the true cross-section are ∼ y cut = 0.01. These cut values are reasonable for our definition of y ij and are equivalent to the cuts made in [9].
After cuts have been made, we see that the ensemble of networks has a significantly reduced standard error when compared with the naive single network approach, with a predicted mean closer to the final stable cross-section. This difference in uncertainty can be understood by comparing the relative standard deviations of the single network and the deviations in the different networks making up the ensembles, as we shall now show.
-11 - Let us first assume that the values of the cross-section calculated using the single network approach, σ s are normally distributed, 5 i.e. σ s ∼ N (µ s , ζ 2 s ), where µ s is the mean of the normal distribution and ζ s is the standard deviation. Secondly, we note that in the case of the ensemble method the outputs of the networks trained over different partitions are first summed (c.f. Equation (2.6)) giving: where N max is defined in Equation (2.7), 6 dσ p is the sum over all weighted matrix elements for a given FKS pair. Since we only partition the divergent region, R div , according to the FKS partition function, we add the differential cross-section over the non-divergent region, dσ non-div . Given that the uncertainties in the individual networks making up the ensemble are expected to manifest themselves in a similar way to the single network approach, we may also assume that these are drawn from a normal distribution such that: Since the uncertainties in the ensemble method are smaller than those found when using the single network approach: From Equation (3.10) we see that not only does the ensemble of models have a reduced uncertainty in comparison to the single model approach, but that each individual model making up the ensemble also has a reduced uncertainty, thus supporting the claim that by using the ensemble method, the models learning the divergent structure are more certain about what they are learning and less sensitive to both model initialisation and dataset size.
The overall accuracy of the ensemble approach, combined with the implications of Equation (3.10), demonstrates that we are learning the divergent structure of the amplitude sufficiently well. As discussed in Section 2.3, it should be noted that Figure 4 does not show the performance of a single model, but rather the average of 20 trained models with their equivalent standard error. Although one does not have to train this many models to still get a good approximation, in Section 3.2 we will see that training additional models is computationally cheap and thus not a large hinderance. For comparison, the performance of training just a single model is shown in Figure 5 as discussed below. Figure 5 shows the differential cross-section of the y ij distribution of the two softest jets as ordered by p T . Here, we still plot the mean of the 20 trained models, but now state the standard deviation from the mean to more clearly show the spread of model performance and the effect of only training a single model. These differential distributions were chosen as they highlight the performance of the models in hard-to-sample regions of phase-space, in particular some of the regions we would expect the FKS partition function to assist with learning. Indeed, we see a significant improvement in comparison to the performance of the single neural network approach when using our ensemble method both in overall per-bin accuracy and stability. In addition, the ensemble method also produces narrower uncertainty bands than the single network approach, thus demonstrating its higher confidence in these regions. While this confidence is seen to be slightly misplaced in the case of the residual peak of the 5-jet plot at y cut = 0.01, we see the harsher cut correcting for this and producing good agreement between the Njet and ensemble results. Similar reasoning as given in Equations (3.4 -3.10) can be applied to the per-bin uncertainty differences between the single and ensemble network approaches.
Overall, the ensemble of networks is shown to produce more accurate and reliable results in LO approximations than the single network approach. While it can be argued that there is greater computational expense in training multiple networks, given the very low cost of network training in comparison to the data generation time this is considered to be negligible, particularly at higher orders (see Section 3.2 for more details).

Virtual Approximations at NLO
When approximating the k-factor, the infrared singularities present in the previous examples have been normalised. This normalisation regulates the number of large divergences in phase-space, allowing the network to focus more on learning the loop-induced divergences. Additionally, although the FKS method is especially useful for isolating soft and collinear divergences at LO, given the presence of log(s ij ) terms in the virtual corrections we still expect to see improvements by using the ensemble method when approximating the k-factor.   Table 1: Time required for k-factor calculation at different multiplicities requiring 1M points while training on 10k and 100k points. These results assume all calculations take place on a single CPU core. Note that the Njet times remain the same as we assume that the training points form part of the inference (see later in this section for more details). Time is quoted to at most 2 decimal places and at 3 significant figures where possible.
As in the LO case, in Figure 6 we plot the error distributions for the single and ensemble cases by comparing the network outputs to the Njet calculations at the per-point level. In the 3 and 4-jet cases we see that both methods perform relatively similarly, with the single network appearing to be slightly better in the case of 4-jets. However, it should again be noted that these plots do not contain information about the network uncertainty and so should not be interpreted as the sole measure of performance.
In Figure 7 we see that both the naive single network approach and the ensemble method approximate the k-factor to within Monte Carlo error at 3-jets, and are within the percent level at 4-jets. Although either methodology would be suitable for use, the ensemble of networks requires little more computational time in comparison to the single network model, while producing narrower uncertainty bands. For robustness at higher multiplicity, the ensemble method remains the more optimal method.
A comparison between the computational speed of different methods of k-factor computation and calculation can be found in Table 1. Here we see a dramatic speed up when using the network approximation as opposed to current numerical methods, with the domi--16 - nant time saving coming from the reduction of the number of matrix elements having to be explicitly calculated using Njet (i.e. in the case of training on 100k points and inferring on 1M at high multiplicity the speed up is O(10)). Moreover, the assertion that the ensemble method is not significantly more expensive than the naive approach can be verified. It should be noted that by only training on 10k points we may have an unacceptable accuracy when compared to the 100k results. The results presented in the table are therefore designed to demonstrate the computational time required for network training in comparison to the Njet calculation, as opposed to providing guidelines on how many training points to use. As in Section 3.1, we plot the differential k-factors of the y distribution of the two softest jets as ordered by p T . In Figure 8 we see that both the single network, and ensemble approach, model the data well, with uncertainties at the 1-2 % level. As before, the ensemble method provides us with slightly narrower uncertainty bands in both the 3-jet and 4-jet cases. Additionally, although neither the single network, nor ensemble approach approximate the peak in the 4-jet distribution exactly, the peak location is more accurately approximated by the ensemble approach with only a single bin at the peak being signifi--17 - cantly ill-approximated. While we do not necessarily see a dramatic improvement in using the ensemble approach, given that the additional training time required is negligible in comparison to the data generation, we still see it as a viable and beneficial method to use for k-factor approximation. It should be noted that similar reasoning as given in Equations (3.4 -3.10) can again be applied to the k-factor and per-bin uncertainty differences between the single and ensemble network approaches at NLO.
Finally, in the case of 5-jets we demonstrate our methodology as it may be used in practice. In Figure 9 we show how one may predict on a set of points with no known Njet results for testing, while understanding the associated neural network errors. From these plots we clearly see that the ensemble method has associated errors only at the level -18 - Figure 9: Normalised NLO/LO k-factor and differential k-factor against y, where y is the minimum y ij as ordered by p T , at 5 jets using just a the neural network ensemble. Data is in the differential plot is normalised to the maximum network output value. Uncertainty bands denote 1 standard error in the k-factor plot and 1 standard deviation in the differential plot, both as a percentage of the mean calculated over 20 trained models.
of 2% in the cross-section, with larger uncertainties in the regions of the differential plot where one would expect Monte Carlo error to dominate.
As highlighted above, when you do not have a test set for comparison, it may be hard to validate the optimal number of training points required for a good approximation. While at NLO we present the results of networks trained on 100k points, and found this number to be relatively optimal with regard to accuracy, stability and training time, we do not claim that this will always be the case for other processes. Although generating more Njet matrix elements for testing is the best way to assess network accuracy, a possible substitute would be to test on the training data. While this is not generally regarded as good practice, given the problem at hand it may not be as bad as in other cases. For instance, unless there is a large degree of noise in the cross-section given the size of the training dataset, as an initial measure of model performance we can quantify the uncertainty in our training set and assess the proximity of our network uncertainties and this Monte Carlo error. Additionally, our network uncertainty calculation depends only on the network's behaviour relative to the training set and is independent of the test set. Therefore, although testing on the training set is still not ideal, given how we calculate our network uncertainties and by using our physics knowledge of the Monte Carlo error, we are able to use this as a first test of network performance without having to generate additional testing data.

Conclusions
In this article we have explored the possibility of optimising simulations for many scale processes needed for LHC analyses. Machine learning technology is finding an increasing -19 -number of applications in particle physics and offers the potential to dramatically reduce the CPU cost of expensive simulations.
The application to scattering amplitudes is a little different to classic examples of neural networks in that the dataset is exact. 7 We can also have complete control over the range of the dataset, although the CPU cost of obtaining the data can be very high. The challenge is to make a sufficiently good fit to the data that a reliable interpolation and extrapolation of differential cross-sections can be made. The CPU cost of the extrapolation/interpolation is negligible in this procedure so the further the network can be extrapolated the better the computational speed up.
In this study we have looked at multi-scale amplitudes which are not well suited to more traditional approximations with polynomial grids. At one-loop scattering for 2 → 4 or higher multiplicity becomes extremely expensive even with modern automated tools. We find that a reliable amplitude approximation can be difficult to achieve when using a single neural network due to the large changes in the amplitude related to its singularity structure. We compare this naive approach to a technique in which an ensemble of networks are use to approximate the amplitude by separating the singularities using an NLO FKS partitioning.
Understanding the reliability of this approach is one of the biggest challenges. By varying the initial data and parameter initialisations used in the network, we find a way to estimate the error on the networks. For all but the highest multiplicity, e + e − → 5 jets, we also provide comparisons to direct integration of the amplitude. At LO we observe that the FKS partitioning provides significantly more reliable and accurate estimates than the single network approach, while in the case of NLO k-factors, where the leading order singularity structure is divided out, the partitioning still helps in these regards with results accurate to within a few %. Moreover, we show in Equations (3.4 -3.10) that each network in our ensemble has a smaller associated uncertainty than that of the single network approach, thus suggesting that the ensemble model is learning the divergent structure with a higher confidence than the naive model. Indeed, this is the case at both LO and NLO. The networks not only provide good scattering amplitude approximations, but also lead to reliable predictions with at least a factor of 10 improvement to the complete simulation.
In this initial study we have made a number of simplifications whose effect could be important when using the technique for a realistic analysis. Firstly, we employed a simple flat phase-space generation using the RAMBO algorithm. This makes it hard to compare with the more efficient generators used in state-of-the-art Monte-Carlo simulations. The JADE jet algorithm may exacerbate the soft singularities and so the effect of alternative jet algorithms, as well as the effect of introducing initial state singularities in pp collisions, should be studied in the future. We also see in the higher multiplicity cases that the error from the neural network approximation does start to increase. It may be in these cases that the NLO FKS separation requires modification. In this study we used a simple version of the partition function based only on the kinematic invariants. In general we can alter the scaling power of the invariants in the various limits which will affect the 7 Technically we restrict to double precision, although higher precision arithmetic could be used in principle.
-20 -behaviour of the FKS regions away from the singularities. We may also find that effects of higher order, double unresolved singularities begin to play a role. Since NNLO sector decomposition strategies are available it would be interesting to explore this direction in the future. Another important step would be to apply the technique to integration of infrared subtracted, real radiation. This case is currently the most CPU expensive part of producing precise differential distributions for comparison with the experiments.