A factorisation-aware Matrix element emulator

In this article we present a neural network based model to emulate matrix elements. This model improves on existing methods by taking advantage of the known factorisation properties of matrix elements. In so doing we can control the behaviour of simulated matrix elements when extrapolating into more singular regions than the ones used for training the neural network. We apply our model to the case of leading-order jet production in $e^+e^-$ collisions with up to five jets. Our results show that this model can reproduce the matrix elements with errors below the one-percent level on the phase-space covered during fitting and testing, and a robust extrapolation to the parts of the phase-space where the matrix elements are more singular than seen at the fitting stage.


Introduction
Neural networks (NN) first appeared in the field of particle physics as a tool for event discrimination in the analysis of collider data. Since then neural networks and other machine learning techniques have proved useful in many other areas of the field. On the theory prediction side they have been used to improve the efficiency of Monte Carlo sampling [1][2][3][4][5], to accelerate the simulation of radiation within a jet [6][7][8], to streamline the processes of generation and unweighting of simulated event samples and [9][10][11][12][13][14][15][16][17][18]. Closer to the experimental measurements they have also been used to emulate detector simulation [19][20][21][22], they can be used to perform unfolding [23] or correcting for detector effects [24], and perform pileup subtraction [23].
The capacity of neural networks to approximate intricate functions have already been used to provide fast calculations of production cross-sections [25,26]. In this article we investigate whether NNs can approximate production cross-sections more differentially by replacing computationally expensive matrix element calculations. The challenge with this endeavour is that matrix elements are plagued with numerous divergences that arise from infrared divergences. In previous works [27,28] a combination of individual neural networks were used to approximate matrix elements. In order to deal with the complex structure of the matrix elements the authors of these studies divided the phase-space into sectors according to the infrared singularities and trained networks on these sectors, thereby limiting the complexity of the fit by isolating a single divergence per sector. All sectors were then combined to make the final prediction. While the authors of the initial study for electron-positron annihilation showed good agreement between the total cross-section and histogrammed distributions both at LO and NLO, we note that the accuracy of the interpolation at the level of individual points was a lot worse than when averaged in the histograms. In addition, the performance of the extrapolation outside of its training phase-space (i.e. more singular configurations than those considered to fit the model) is problematic.
In this article we present a different approach to the emulation of matrix elements that incorporates the factorisation properties of the matrix elements in the interpolation model and therefore is able to safely extrapolate the matrix element in regions more singular than that covered by the data used for the training of the emulator. We find that our interpolator also displays a much improved pointwise accuracy.
The article is organised as follows. Section 2 introduces our factorisation-aware deep neural network model, Section 3 showcases our results with a comparison with the model from Ref. [27] in Section 3.1, and an analysis of the performance of our model in Section 3.2. To show that the neural network is both interpolating and extrapolating well we show its behaviour on random trajectories in phase-space in Section 3.3. Our conclusions are presented in Section 4 and some discussion of specific details are collected in the appendices so as not to distract from the main discussion.
We provide code to reproduce the methodology detailed in this article [29].

Fitting framework
For this work we consider the e + e − → Z * /γ → qq + ng matrix elements for n up to and including 3, which corresponds to events with up to 5 jets. We denote the number of jets in the final state as n j .
We formulate the problem of emulating matrix elements as a supervised regression task with a set of phase-space points' kinematic information as input and the values of the matrix element for each of these phase-space points as the targets. Section 2.3 describes how these matrix elements were obtained.
A neural network can be seen as a function f ( x; θ) = y, where f : R d → R maps a d-dimensional vector x of inputs onto the vector of outputs y, and where θ are the parameters of the neural network which we aim to optimise such that the outputs y of the neural network match the target as well as possible. The simplest implementation of an emulator would be to take the input as the kinematic information of the phase-space point and the output to be the full matrix element. Our approach modifies both the input of the NN and its target, as described in the following sections.

Infrared divergences and dipole factorisation formula
It is well known that in soft and collinear limits the matrix element in n+1-body phase-space factorises into a singular factor and a reduced matrix element in n-body phase-space [30,31]. This factorisation was used by Catani and Seymour [32] to construct subtraction terms for the real radiation part of a NLO calculation. They introduced a factorisation formula with universal dipoles that smoothly interpolates between the soft and collinear limits to capture the singular structure in these regions of phase-space. The dipole factorisation formula can be written schematically as where V ij,k is a process independent, singular factor. It depends on the momenta and quantum numbers (colour and spin) of partons i, j, k, where i is the emitter parton, j is the emitted parton, and k is the spectator parton. For singly unresolved limits, this factorisation isolates all the divergent behaviour in V ij,k and the factor |M n | 2 is free of divergences, which makes it more amenable to emulation through a neural network. The dipole factorisation formula forms the basis of our fitting ansatz which we present in detail in Section 2.2.

Fitting coefficients of Catani-Seymour dipoles
Instead of using a neural network to fit the matrix element directly, we use the dipole factorisation formula to build an ansatz of the colour and helicity summed n + 1-body matrix element, where D ij,k = V ij,k /s ij are the spin-averaged Catani-Seymour dipoles divided by the corresponding Mandelstam invariant and C ijk are the coefficients we train the neural network to fit. C ijk can be interpreted as the reduced matrix element in n-body phase-space. Since the input for the C ijk function is the full n + 1 phase-phase information, the neural network will also model the phase-space mappings usually introduced in the factorisation formula. A schematic diagram illustrating our ansatz is given in Figure 1. The sum over {ijk} denotes the sum over relevant permutations of the external outgoing legs. More detail on this is given in Section 2.4.2. The representation (2.2) is not unique but through appropriate training, the neural network takes advantage of the right ingredients to model the divergent soft and collinear behaviour of the matrix elements. This form of the ansatz allows the neural network to avoid fitting a rapidly varying function over the phase-space, leaving the Catani-Seymour dipoles to reproduce the correct singular behaviour, meaning a single neural network can interpolate a now relatively smooth function over the phase-space.

Data generation
For all multiplicities, phase-space is sampled uniformly using the RAMBO algorithm [34] with a centre-of-mass energy √ s com = 1000 GeV. Phase-space points are subsequently Figure 1. Schematic diagram of our neural network architecture. We have a densely-connected neural network with inputs phase-space points, p, and recoil factors, y ij,k , propagated through hidden layers to the output layer which outputs C ijk . These coefficients are combined with their corresponding spin-averaged dipoles as in (2.2) to produce an approximation of the matrix element. Diagram of neural network generated with the aid of [33].
clustered using FastJet [35,36] with the e + e − k t algorithm [37]. Global phase-space cuts are applied according to the criterion y cut ≤ y ij where y ij are the Mandelstam invariants normalised by s com . Jets are clustered exclusively where d cut was supplied to FastJet. We took d cut = max(2 × y cut , 0.01 × s com ). We explore three different values of the global phase-space cut parameter, y cut = [0.01, 0.001, 0.0001], to demonstrate the ability of the factorisation-aware neural network to effectively interpolate in more and more singular regions of phase-space.
The generated phase-space points are fed to the NJet package [38] to calculate colour and helicity summed tree-level matrix elements. All external legs have been considered to be massless. The strong coupling constant has been set to α s = 0.118, and the electromagnetic coupling constant has been set to α e = 1.0/132.5070. The mass of the Z-boson is taken to be m Z = 91.188 GeV. These parameters and those not listed, are consistent with those in the Standard Model mode of MadGraph5_aMC@NLO [39].
The phase-space points generated form the basis of our inputs to the neural network with the matrix elements as our fitting targets. As with most machine learning applications, we need to demonstrate that our neural network emulator has managed to generalise outside of the training dataset. We do this by firstly testing on an independent testing dataset that is never exposed to the network during training, and secondly by predicting on random trajectories in phase-space. Generation of phase-space trajectories is described in Appendix C. We believe that accurate predictions on random phase-space trajectories demonstrates the ability of the neural network to extrapolate to never before seen data that is of a different nature to both the training and testing datasets.

Neural network emulator
We construct our emulator using a densely-connected neural network built using the Keras API [40] with the TensorFlow back-end [41] with GPU support. A simple model architecture such as a densely-connected neural network allows for quicker training and inference compared to more complicated setups.

Inputs and outputs
Inputs to neural network As mentioned in Section 2.3, phase-space points form the basis of the inputs to our neural network. We input the 4-momenta of each outgoing parton as an (n j × 4) array with each component of its 4-momenta standardised to zero mean and unit variance across the training dataset. Although it would be feasible to omit the energy component of the momenta or leave one outgoing parton out of the training due to our datasets being generated with a fixed centre-of-mass energy, we find that keeping all momenta information to improve the network's performance. With the acceleration of training of neural networks on GPUs, the slowdown in keeping all outgoing momenta components is negligible.
Along with the 4-momenta, we also include the recoil factors as input for relevant permutations of {ijk}. We take the natural logarithm of y ij,k before standardising them to a zero mean and unit variance across the training dataset. We find that the addition of these recoil factors significantly improves performance of our neural network emulator during training and testing. This can be attributed to the fact that the coefficients C ijk rely heavily on the mapped momenta in n-body phase-space, where y ij,k is usually required when performing the momentum mapping, see e.g. [32]. Following our ansatz (2.2), we must provide spin-averaged Catani-Seymour dipoles to the neural network. These dipoles are computed for all phase-space points before training begins, but they are not passed directly to the neural network as input features. Instead, they are only included in our custom loss function which will be explained in the next subsection.
Accounting for spin-correlation in g → gg In addition to the Catani-Seymour dipoles, we include other functions to account for the spin-correlations of g → gg and g → qq splittings which are present in the factorisation formula but averaged out in the spinaveraged dipoles. This effect becomes relevant when there are two or more gluons in the final state. We seek to capture this behaviour by introducing a pair of terms of the form in the fitting ansatz for each gluon pair. The coefficients S ij and C ij are fitted by the neural network along with the dipole coefficients. The angle φ ij is the azimuthal angle of the decay particles in the plane perpendicular to the parent particle momentum. The procedure to obtain this angle is described in Appendix A.
Outputs of neural network Denoting the raw output of our neural networks as c ijk , they are transformed to C ijk according to where S coef = S pred /S dipole . S pred is the prediction scale, taken to be the minimum of the matrix elements in the training set and S dipole is the dipole scale, taken to be the mean of all dipoles in the training set. These scaled coefficients are then multiplied with their corresponding Catani-Seymour dipole D ij,k , and then summed to produce an estimation of the matrix element. The targets are the matrix elements corresponding to the phase-space point inputs in the training dataset. We transform the targets according to to reduce the orders of magnitude the matrix elements span. This performs a similar transformation to taking the natural logarithm except that it remains a valid transformation for negative arguments. We require this transformation to allow negative values as arguments because the coefficients C ijk will not be restricted to only positive values, meaning that during training there is a possibility for the outputs of the network to go negative. By reducing the span of the targets, the neural network is able to more effectively pick out patterns across the entire training dataset rather than a smaller region, helping it to generalise. This technique is employed in other studies, see for example [42,43]. We would like to stress that we do not expect, and have not observed, negative predictions for the matrix element from the neural network, as this would be unphysical. y is finally standardised to a zero mean and unit variance.
Training, validation, and testing datasets In Ref. [27], the authors used 500k training samples to train their models. For a fair comparison of our respective methods, we follow their methodology closely by constructing our models to be as close as possible to theirs and generate training and testing data by using code from their project repository [44] 1 . Our model architectures will not be identical due to the difference in our methods, details of our model architecture are given in Section 2.4.2. Results of this comparison are presented in Section 3.1. While training on 500k samples gives acceptable performance for the total cross-section, per-point accuracy is lacking. We find that increasing the size of the training dataset drastically improves the per-point prediction accuracy. Neural networks have been shown to scale well with large datasets [45] and given that they only need to be trained one time, it is useful to provide neural networks which have been pre-trained with maximum accuracy in mind. In addition to improving the per-point accuracy, we aim to overcome the problem of extrapolating to more singular regions. It is well known that neural networks in general do not extrapolate well [46,47], but with our factorisation-aware model we show that letting the models learn about the infrared structure of QCD alleviates this problem. 1 for our main results we generate all data using our own code
To demonstrate the scaling performance of our model, we present our main results with models trained on more training samples where details of data generation are given in Section 2.3. For each multiplicity and global phase-space cut we generate a dataset consisting of 60 million phase-space points and their corresponding recoil factors, dipoles, and matrix elements. We then split this dataset into training, validation and testing datasets in a ratio of 4:1:1, meaning we have 40 million training samples, 10 million validation, and 10 million testing samples. The validation dataset is used to monitor model performance after each epoch of training, whereas the testing dataset is used as an out-of-sample check of the model's performance after training is complete.

Architecture
We have a fixed base to our neural network architecture that is used for all processes considered in this work, with variations in the number of nodes in the input and output layers due to the change in number of outgoing partons. Our base neural network consists of one input layer, eight hidden layers consisting of (64, 128, 256, 512, 768, 386, 128, 64) nodes, and one output layer.
The number of nodes in the input and output layers scales with the number of jets in the final state. The input layer has (n j × 4) + n rel nodes, and the output layer has n rel + (2 × n φ ) nodes, where n rel is the number of relevant permutations and n φ denotes the number of φ ij angles. Relevant permutations, {ijk}, are the set of permutations for which their corresponding dipole could have the possibility to have a meaningful contribution to the matrix element for a given process. They are a subset of all the possible permutations for a given multiplicity, P (n j , 3). Relevant permutations exclude any permutation where a quark or anti-quark are emitted (i.e. j = q orq), as low energy quarks do not give rise to singularities in our processes of interest. Furthermore, we can remove degenerate permutations where swapping i = g and j = g has no effect as they have identical Catani-Seymour dipoles, e.g. D 34,1 = D 43,1 , so we only keep D 34,1 . The omission of these redundant permutations speeds up training of the neural networks as we have fewer inputs, fewer dot products to compute in the loss function, as well as speeding up inference due to there being fewer dipoles to compute. For reference, we list the number of input and output nodes for each multiplicity we consider in Table 1.
The neural network weights are initialised according to the 'Glorot uniform' distribution as described in [48]. We use the tanh activation function for all nodes in the hidden layers, and have a linear activation function for nodes in the output layer. Initial learning rate is set to 0.001 and training mini-batch size is set to 4096. During training, we reduce the learning rate by a factor of 0.7 whenever there is no improvement in validation loss for 20 epochs. We use the Keras callback ReduceLROnPlateau to achieve this. Model training is terminated after the validation loss does not improve after 40 epochs of training using the EarlyStopping callback in Keras. We find that reducing the learning rate during training helps the model to converge to more optimal parameter sets. Since there are periods during training where the validation loss stagnates, reducing the learning rate helps to reach minima which otherwise wouldn't be accessible due to a too large learning rate. There is a possibility to reduce the learning rate too rapidly causing the model to have suboptimal optimisation, but this is countered by the high patience we set for the ReduceLROnPlateau callback.
These choices of hyperparameters were not results of extensive scanning of parameter space and were chosen heuristically. Hyperparameters can be tuned more optimally using more sophisticated methods, but they all rely on training a large number of models which is computationally expensive and time-consuming.
Custom loss function To assess the model's performance when training we need to compare the network predictions with the targets. Our metric for the model's regression performance is the mean squared error (MSE) where y i is the target for the i-th sample out of N samples, and p( x i ; θ) is the corresponding prediction obtained by combining the dipole factors and the azimuthal dependency terms multiplied by their NN-learned coefficients. In order to correct for the scale difference, we need to apply the transformation (2.6) on the neural network prediction first, with the same S pred that was used for scaling the targets. We choose the mean squared error to measure the network's performance because it is sensitive to outliers in the target distribution. This is useful because even though we have taken measures to reduce the span of the targets, the distribution still contains a tail towards larger values which correspond to soft and collinear configurations. These points have large contributions to the cross-section when integrating over phase-space, meaning it is important that we accurately predict these points. It is also convenient that the mean squared error tends to learn the mean of the target distribution which, in our case, corresponds to the cross-section. In addition to the MSE, we introduce a regularisation term to penalise non-sparse representations of the matrix element. We know that in soft and collinear limits there will be dominant dipoles that have large contributions to the matrix element and there will be other dipoles with minimal contribution. We try to suppress the coefficients associated with these minimally contributing dipoles with the penalty term where the sum over i replaces the sum over {ijk} for brevity, and j D −2 j is the sum over all dipoles for a phase-space point, acting as a normalisation factor. J is a tunable parameter that scales the importance of L pen versus L MSE . We found that models perform the best when L pen < L MSE , so J is tuned accordingly. It is possible for the product C i D i to be large due to D i alone, so to penalise this we regularise the product rather than just the coefficient, since it is the product that contributes to the matrix element.
The form of L pen is reminiscent of the usual L1 regularisation which promotes sparse models. Regularisation is usually included to prevent overfitting by making the model make decisions on the most important features, reducing other features to zero. In our case we would like the neural network to learn about the universal factorisation property in QCD by making it choose a minimal amount of dipoles to represent the matrix element in singular regions. In addition to preventing overfitting, (2.8) helps the neural network to extrapolate to more soft and collinear regions as it has learnt to choose which dipoles are relevant in specific configurations. For non-singular configurations, the neural network is free to interpolate as there is not a clear set of dipoles that dominate, meaning L pen is small.
Combining the regression loss term and the regularisation term, our expression for the total loss is which is minimised through mini-batch gradient descent [49] with the Adam optimiser [50] to find optimal parameters θ for the neural network.
Ensemble of models Due to the random initialisation of weights in the neural network, and the fact that the optimisation procedure is carried out on mini-batches of the full training dataset, every neural network trained will be similar but non-identical, even with identical model architecture. This is partly because the loss surface is unlikely to be a completely smooth surface with a single global minimum, instead it is likely to contain multiple local minima, meaning it would be optimistic to believe that a single neural network is able to find the most optimal set of parameters. Given that we don't expect a single network to perform optimally, we train a number of models and aggregate their predictions to create an ensemble of models. Ensembling models is a well-known technique within machine learning, for a review see [51]. Each model in the ensemble is initialised with different weights according to the 'Glorot uniform' distribution, and trained on the same but randomly shuffled dataset, resulting in models that have been exposed to different distributions of the training data. After sufficient training, each model will have a distinct set of parameters that have similar predictive power. The prediction for the matrix element is then the mean of the outputs of all models in the ensemble 2 . Taking the mean will give a more accurate and robust prediction as averaging over the different models will reduce variance due to over/underfitting in the training phase. We choose to have 20 models in our ensembles because we begin to see diminishing returns in the accuracy of per-point predictions after this. Another advantage of training an ensemble of models is that we have a measure of uncertainty, due to the neural networks, on the model predictions by calculating the standard error of the mean. That is we take the standard deviation of predictions across the ensemble and divide by the square root of number of models in the ensemble. This would not be possible with just a single model.
By choosing to build an ensemble, there is a performance impact because we have to spend more resources on training, and inference is slower due to having to predict on all models in the ensemble. However, we believe that having a more robust prediction with a measure of uncertainty outweigh these negatives. Additionally, each model only needs to be trained once, and slowdown during inference is alleviated with GPUs.

Results
In this section we present results for our matrix element emulator for e + e − annihilation into up to 5-jets. We first compare results obtained with our method to the tree-level results of Ref. [27], then proceed to exhibit results from larger NNs that have been exposed to larger training datasets to demonstrate the full scaling performance of our model. We then further demonstrate our method's capability to generalise to unseen regions of phase-space by assessing the prediction accuracy on random phase-space trajectories that venture well outside of the phase-space region used for the training.

Comparison with previous work
We compare our method to methods for tree-level matrix elements emulation from Ref. [27]. The authors presented two methods: 'single' and 'ensemble' models. A 'single' models indicates that there is one neural network trained across the entire phase-space, and an 'ensemble' model indicates that there is a group of neural networks trained together with weighting functions that focus individual networks on a specific divergent region of phasespace. In order to conduct a fair comparison, we have made attempts to follow their training methodology closely by constructing the neural networks at the centre of our model with a similar structure (e.g. same hidden layer structure, same activation functions, same random initial weight distribution), training methodology (same EarlyStopping criteria, same initial learning rate), and have generated training and testing datasets using code available on the authors' project repository with the relevant cuts.
We compare the distribution of errors for the matrix elements predictions on 3 million phase-space points. The distribution of errors is crucial because it informs us of the performance of our emulators at the matrix element level. It is well known that a neural network optimised with the mean squared error loss function has tendencies to learn the mean of the target distribution [52], meaning the quality of the cross-section prediction can potentially belie the point-to-point accuracy of emulators. In Figure 2, we plot the distribution of errors for matrix element predictions where our method is labelled 'Dipole NN'. We compare against the 'single' and 'ensemble' methods by training and testing using the corresponding datasets. For more detailed information on the differences please refer to [27]. Note that the height of the peak for the dipole histograms are not illustrated in the figure as it would not fit on the current axes but that is not important for this comparison.  Error distribution compared to Figure 3 in Ref. [27], where data to reproduce the histograms were provided by the authors. We plot the log ratio of the matrix element as predicted by the neural network ensemble and the value from NJet on the main axes for comparison. The blue and orange dipole histograms representing our method are cut off at the top on the main axes, but the most important feature is the narrowness of the peak centred around the ideal value. The insets show the detailed distribution of our result on a linear scale.
We can see that the prediction-to-truth ratio distribution for our method is much narrower and consistently peaked around the ideal accuracy, indicating our model performs better on a per-point basis for all multiplicities. Even with this reduced NN size we can see that incorporating the known divergent structure explicitly in the model gives better results, as it uses the NN representation to learn a function that is more suitably approximated by a NN. For example, even though the three jet matrix element has a fairly trivial analytical structure, a standard fitting approach using a NN typically struggles to reproduce divergences. In our approach the NN only needs to emulate a non-singular modulation on top of the main divergent behaviour and is therefore more suited to the task.

Main results
Here we present our main results which are obtained using the larger NNs described in Section 2.4.2 along with larger training datasets described in Section 2.3. In Figure 3, we  show the error distributions on 10 million matrix element predictions for each multiplicity and global phase-space cut.
Predicting on a large number of phase-space points allows us to explore singular regions with higher statistics. The ratios (right) clearly highlight the symmetry of the errors with Gaussian-like distributions tightly centred around 0. We have also included the absolute percentage difference distribution for more easily interpretable errors where the bulk of predictions are below the 0.1% error level. With increasing multiplicity, the fitting gets more challenging due to the rise in the number of singular regions in phase-space and the dimensionality of the phase-space, which can be seen in the decrease in accuracy as we increase multiplicity. Although the errors do increase, practically all matrix element predictions are below the 1% error level even for the most challenging scenario. Relaxing the global phase-space cut for the training and testing set is also expected to decrease model performance as allowing more singular regions of phase-space stretches the span of the target distribution making it difficult to fit with a single neural network. Our method manages to retain good performance while global phase-space cuts have been relaxed by a factor of 100 with only a small decrease in accuracy as illustrated in the left column of Figure 3. This is because most of the span is accounted for by the dipole factors, while the coefficients themselves vary less. In the 3-jet case there is negligible difference between the different phase-space cuts while the 4 and 5 jet cases see less than a factor of 10 difference in the peaks of the absolute percentage difference distributions going from y cut = 0.01 to y cut = 0.0001.
By increasing the size of the training datasets we aim to expose the neural network to more samples of the phase-space, thereby increasing accuracy on predictions on as much of the phase-space as possible. Along with increasing the number of training samples, the neural network architecture has been expanded to include more hidden nodes and hidden layers. The extra hidden nodes and layers introduces more parameters into the model allowing the neural network to utilise the additional data. We found that to get good performance, we had to balance the size of the training dataset used and the size of the network. i.e. a small network is not expected to capture all the variations in a large dataset as easily as a larger network would, due to the smaller number of parameters available to the network. Of course, we also had to consider more physical constraints such as the time spent on training the neural networks which limits both the size of training datasets and architecture. In Figure 4, we show the improvements in accuracy of our main NNs compared to the smaller NNs from Section 3.1. Although the training and testing data for the smaller NNs are not identical due to differences in code used to generate the sets 3 , for this comparison it suffices to show that the larger NNs are orders of magnitude more accurate than the smaller NNs. Improvements in per-point accuracy translate to improved total cross-section predictions. In Figure 5 we show the percentage differences of the NN cross-section predicted compared to those from NJet. There is a similar trend of errors increasing with increasing multiplicity and more inclusive phase-space cuts. All total cross-section predictions are well below 0.1% error. There is a small systematic offset of the neural network cross-section compared to the NJet cross-section that becomes apparent under closer inspection. This was discussed in Ref. [27] and we provide an additional explanation in Appendix B.
We also show in Figure 5 the estimated statistical Monte Carlo integration relative error for comparison. The fact that the accumulated error on the matrix element is much lower than the MC error on the cross-section opens up the possibility to use the knowledge 3 although RAMBO is used for phase-space generation in both works, the selection criteria is different (JADE algorithm vs kt algorithm)  the network has gathered on the matrix element to augment the dataset to reduce the statistical error. Using such an augmentation technique would introduce a new systematic error on the prediction related to the accumulated network interpolation/extrapolation error, which would have to be balanced with the reduction in the MC integration error. Figure 5 suggests that the dataset could be augmented in such a way by a large factor before reaching a minimal overall uncertainty. This opportunity might not seem very useful for this particular example of leading order matrix elements where evaluations are relatively cheap computationally, but if a similar degree of accuracy in the emulation can be obtained for higher order matrix elements, this procedure could reduce the resource cost of matrix element calculation significantly. We defer the study of this augmentation method to future work.
Since we retain good performance by relaxing the global phase-space cut, we carry out a simple test of generalisability by using the 5-jet y cut = 0.0001 model to infer on the two datasets with harsher cuts. This is shown in Figure 6 where we see that accuracy is comparable to the reference (blue) in both cases. In the case of y cut = 0.01 (left), the model trained with more of the phase-space reduces errors in the right-hand tail of the  distribution. This proves that enlarging the training phase-space can be done without having a large detrimental effect on the overall accuracy, and can significantly reduce the number of large prediction errors.

Random trajectories
As another test of our emulator we assess the accuracy of our predictions on random phasespace trajectories. These random phase-space trajectories are generated by connecting two random points in phase-space continuously without excluding any region of phase-space. This presents an interesting and challenging test of the interpolation and extrapolation abilities of the NNs as some parts of the trajectories may lie outside of the phase-space region of the training datasets. We show the results for 5-jet trajectories as this is the highest multiplicity we considered, predictions on lower multiplicities are better-behaved. We investigated 50 different random trajectories 4 . For the discussion in this article we chose one that contains many interesting features, namely the matrix elements span many orders of magnitudes and there are distinct peaks in the trajectory. In Figure 7 we show the predictions by the three 5-jet models trained on data with different global phase-space cuts for this trajectory. Left column shows the actual matrix element prediction and right column shows the ratios of the prediction to NJet. We analysed the predictions for the 50 random trajectories and measured the fraction of their length where the accuracy falls within given intervals. Table 2 shows the result for the regions of phase-space where training data was available, and those falling beyond the data available to the model. NN predictions are depicted as coloured scatter plots where the colour indicates the value of the minimum s ij between any pair of final-state particles at that phase-space point. To more easily visualise the extrapolation performance of the NNs we highlight the regions where the minimum s ij goes below the global phase-space cut applied to the training set the models were trained on, for each cut made. The regions of the plots where this occurs have been coloured in red. With these trajectories being completely randomly selected in phase-space there is a possibility for there to be doubly singular points or worse. To check for this we used FastJet to cluster the phase-space points in the same way we did for data generation, see Section 2.3. The pink regions indicate points which have two separate single unresolved limits, we label this configuration 'Single+single'. The purple regions indicate points which have a double unresolved limit (i.e. three particles in one jet), we label this configuration 'Double'. The blue regions indicate points which have both a double unresolved limit and a separate single unresolved limit, we label this configuration 'Double+single'. We do not include the quark-anti-quark invariant in defining these regions as there is no associated infrared divergence. The pink, purple, and blue bands indicates regions of points which would have been discarded for our training and testing datasets.
Accuracy is high when the minimum s ij of the trajectory is not below any y cut , i.e. when the NN prediction curve is blue. This is demonstrating that the NNs are interpolating well. Performance generally declines in the red, pink, purple, and blue regions, which is not unexpected as the NNs are extrapolating. Given that this trajectory has regions which go 4  more collinear than any points the networks have been exposed to before, we would expect the networks which have been trained with the smallest y cut parameter to perform best. We see that this is the case as accuracy is acceptable in the y cut = 0.0001 models, including in the regions where the NN is extrapolating.
In summary, we have shown that the neural networks show acceptable performance on random phase-space trajectories which are of different nature to the datasets used to train and test the networks. Given that the general performance of the NNs of all three phase-space cuts are similar, it would make sense to use the models trained with the most inclusive phase-space cuts as it has been exposed to more of the complete phase-space.  Table 2. The performance of trajectory predictions separated for white, pink, purple, blue, and red regions. We present the percentage of points that lie outside 0.1%/1.0%/5.0% errors. Fraction of points indicates the percentage of points that lie in the region of interest, out of all phase-space points from the 50 trajectories we examine.

Conclusion
In this article we presented a new strategy to emulate matrix elements using a neural network. By leveraging the knowledge of the factorisation properties of the matrix elements our model is able to extrapolate well outside of its training range. We showed that using this method we obtain significantly improved per-point accuracy than obtained in previous works. We also showed that the per-point accuracy of the model is not significantly affected by the generation cut for the training, which means that it would be possible to train our emulator on very inclusive cuts, allowing them to be applied in a multitude of settings.
The accuracy of the emulation could allow users to augment the training dataset to reduce the MC error of cross-sections or distributions while using fewer computing resources compared to the original calculation. We leave the investigation of this aspect to further work.
Our method was demonstrated in this article using a tree-level process, but it could be generalised to higher order matrix elements by adapting the set of ingredients made available to the network for the interpolation. Specifically, we can include extra terms into our ansatz to help capture additional divergences at higher orders. We look forward to applying this method to such cases.

C Phase-space trajectories
We can see the RAMBO algorithm as a map from the unit hypercube in some high dimension into the n-particle phase-space. A flat distribution in the hypercube maps to a flat distribution in the multi-particle phase-space. To generate our phase-space trajectories we pick two points at random in the unit hypercube and map the line between them using the RAMBO mapping. As a result, each point on the resulting trajectory has equal probability density in phase-space. Any other phase-space generator that smoothly maps the unit hypercube to a multi-particle phase-space could replace RAMBO in this procedure and could lead to trajectories with very different characteristics. One could for example imagine a sophisticated algorithm that only maps points close to the boundary of the hypercube to soft or collinear configurations. Using such an algorithm would have trajectories that avoid configurations with many particles more collinear or soft than the end-points of the trajectories. We find that with RAMBO the trajectories tend not to avoid difficult phase-space configurations and are therefore a good test of the extrapolation properties of our method. Figure 8 shows the trajectory we chose in Section 3.3.