DeepXS: Fast approximation of MSSM electroweak cross sections at NLO

We present a deep and active learning solution to the prediction of particle production cross sections over a complicated, high-dimensional parameter space. We demonstrate the applicability by providing state-of-the-art predictions for the production of charginos and neutralinos at the Large Hadron Collider (LHC) at the next-to-leading order in the phenomenological MSSM-19 and explicitly demonstrate the performance for $pp\to\tilde{\chi}^+_1\tilde{\chi}^-_1,$ $\tilde{\chi}^0_2\tilde{\chi}^0_2$ and $\tilde{\chi}^0_2\tilde{\chi}^\pm_1$ as a proof of concept which will be extended to all SUSY electroweak pairs. We obtain errors that are lower than the uncertainty from scale and parton distribution functions with mean absolute percentage errors of well below $0.5\,\%$ allowing a safe inference at the next-to-leading order with inference times that improve the Monte Carlo integration procedures that have been available so far by a factor of $\mathcal{O}(10^7)$ from $\mathcal{O}(\rm{min})$ to $\mathcal{O}(\mu\rm{s})$ per evaluation.


Introduction
Dimensionality persists to be a curse for everyone that seeks the needle in a complex haystack. Despite all the achievements from data science so far, physicists often resort to simplified, lower-dimensional models to obtain a tractable problem [4,6]. This strategy prevents the scientific community from utilising all the available information to pin down the laws of nature. To overcome this very general issue, we investigate deep learning techniques as a potential solution a e-mail: Sydney.Otten@ru.nl motivated by the successful application of neural networks to cross sections with a four-dimensional parameter space [14]. We find positive results for cross sections that depend on a 19-dimensional parameter space with highly complex structures.
One of the most widely studied beyond the Standard Model (BSM) theories remains supersymmetry (SUSY) [20,25,31,32]. The ever increasing sophistication of experimental analyses requires that theoretical tools match the precision requirements set by experiments. One of the key requirements is performing cross section calculations of BSM processes at least at the next-to-leading order (NLO) accuracy. This goal has been gradually reached over many years, for particles produced both by strong and weak interactions, and the current state-of-the-art calculations also include resummed higher order corrections [11,22]. Currently, for most applications it is possible and sufficient to calculate the production cross section at the next-to-leading-log approximation. However, such calculations are typically time consuming, e.g. it takes about three minutes for the computer program Prospino [10] to calculate the chargino pair production cross section, pp →χ + 1χ − 1 , at NLO. The computational time for Resummino [23] is similar at NLO but taking into account higher order corrections increases the time consumption 20fold.
Many applications, for example global scans of the multidimensional parameter space of the Minimal Supersymmetric Standard Model (MSSM), see e.g. [7][8][9]15,28,39], demand a much faster method for the computation of NLO cross sections. In the case of strongly produced SUSY particles, this problem is addressed by the computer program NNLLfast [12] that offers an approximation of relevant cross sections within a fraction of a second at the next-tonext-to-leading logarithmic accuracy.
In this paper we present a novel approach that enables a fast approximation of cross sections in a high-dimensional parameter space and as an example demonstrate the applicability for chargino and neutralino 1 production cross sections at NLO accuracy in the phenomenological , where 19 denotes the number of model parameters. We employ a machine learning technique to approximate the result from the cross sections calculated using Prospino. While the task might appear straightforward, there are several challenges that one has to solve to obtain a tool that provides both speed and high accuracy. Firstly, the cross sections span over up to 13 orders of magnitude, depending on the electroweakino masses and couplings. Secondly, the electroweakino sector is parametrized by four independent parameters in the SUSY Lagrangian and, in addition, the cross sections depend on the other SUSY particles, either at the tree-level (squarks) or at the loop level (gluinos).
At the parton level, chargino and neutralino production occurs via s-channel exchange of gauge bosons and t-channel exchange of squarks. In case of the chargino production, this includes γ and Z exchange in the s-channel and lefthanded (doublet) squark exchange in the t-channel. For neutralino pair production we have contributions from the Z boson exchange 2 and both left-and right-handed squarks. Finally, the associated production of a chargino and a neutralino occurs via exchange of the W boson in s-channel and left-handed squark exchange in the t-channel. At the loop level, when one considers SUSY-QCD contributions (i.e. the first order in the strong coupling α s ) there appear contributions involving gluinos. In Fig. 1 we show sample diagrams at the born and loop level. For more details see Ref. [10]. In the final step to calculate the production cross section in proton-proton collisions, the partonic cross section has to be convoluted with a parton distribution function (PDF) which parametrizes proton in terms of its constituents: quarks and gluons. Thus the final result cannot be given in the analytical form.
For the actual calculation of the cross section one needs to specify the final state particles and their physical masses, masses of the virtual particles (squarks and gluinos) and the mixing angles in the chargino and neutralino sectors. The 1 Charginos and neutralinos are supersymmetric partners of the Standard Model gauge and Higgs bosons. In the following we will use the umbrella term "electroweakinos". 2 At the loop level, photon exchange in the s-channel is also possible, however this electroweak correction is not included in Prospino.  chargino mixing is parametrized by two 2 × 2 unitary matrices U and V , while for the neutralino mixing it is a 4 × 4 unitary matrix N . For each process only one specific row of the matrices is required, corresponding to the respective final state particle. We summarize the number of required parameters for each process in Table 1. Note that in the following we do not explicitly impose the unitarity condition, which gives us a flexibility to extend the approach to other SUSY models beyond MSSM. Thus, we need to construct representations for complicated functions whose effective parameter space that one has to cover can have up to 13 dimensions. A temperature plot showing the non-trivial K -factor landscape in only two of these dimensions is shown in Fig. 2 forχ 0 2χ + 1 . Here, we focus on the four most relevant processes at the LHC, i.e. the production of chargino pairs,χ + 1χ − 1 , neutralino pairs,χ 0 2χ 0 2 , and associated production of a chargino and a neutralino, χ 0 2χ ± 1 . The approach that we present here can be extended to other electroweak processes and models, e.g. the next-to-MSSM scenarios [21].

Methodology
In order to develop a code that can predict values of an otherwise computationally expensive function fast and reliablythe cross section in our case -we take the following approach. First we calculate values of the function at a large number of points, 10 7 samples at the leading order (LO) and O(10 5 ) samples at NLO. The points are sampled randomly in a highdimensional parameter space in given ranges. The data is then used to train a customised artificial neural network (ANN), which is adopting deep learning techniques, stacking and an iterative ANN-based point selection procedure that picks points from a labeled pool of samples. The properly trained model is then able to provide accurate predictions of the cross section at a given parameter point. The performance of the resulting ANN is tested with 10 4 samples that the deep network has never seen during training.

Data generation
The pMSSM-19 parameters are sampled with a flat prior within the ranges given in Table 2, see also Ref. [2]. Since sleptons do not affect the actual calculation of the cross section at any stage they are assumed to be mass degenerate between left and right-handed states for the first and second generations. These parameter sets are then passed to SPheno 3.38˜ [36,37] to calculate the spectrum with default settings. For further processing we accept the points which have: no tachyonic degrees of freedom; the lightest neutralino as the LSP; the first two generations of squarks heavier than 500 GeV, cf. [1]; charginoχ ± 1 heavier than 100 GeV, cf. [40]. They are then fed into Prospino 2.1˜, which calculates the cross section using CTEQ6 parton distribution functions (PDFs) [33,38]. Note that even though the scan is performed in terms of the soft SUSY breaking param- eters, the actual input for the cross section calculations will be defined via physical masses and mixing angles. Thus, the relevant masses and mixing angles from the spectrum with the corresponding LO cross section and/or K -factor are systematically collected so that they can be used to optimise an ANN implementation as training and validation data. For all LO cross sections, we have created 10 7 samples. For the K -factors, the number of generated samples varies between 1-6 × 10 5 , for reasons explained in the following. The NLO cross section can be written as a product of the K -factor and the LO cross section: Since most difficulties in the structure already appear at the leading order and the K -factor is a slowly varying function of the input parameters, we construct the NLO prediction by multiplying the predictions of the LO and K -factor regressors. This significantly decreases the computational cost by reducing the amount of necessary NLO data by two orders of magnitude.
Because the starting point of the data generation is the pMSSM-19 parameter space, we must cover it appropriately and therefore we are confronted with the curse of dimensionality. To tackle this, we manually restrict the parameter space by excluding all cross sections that are not relevant. By exploiting the fact that the number of events N at the LHC is equal to the product of the integrated luminosity and the cross section: and assuming the final integrated luminosity of the LHC to be L int = 3000 fb −1 , we can derive a lower bound for the cross section by demanding at least one event in the life-time of the LHC. The resulting lower bound is σ min = 3.3 · 10 −7 pb. The generated data is then processed by a deep learning pipeline that utilises the ANN-based point selection (NNPS) and stacking, i.e. a manually implemented logical connection of different, eventually specialised predictors for the same task. Since we assume that most use-cases will run a spectrum calculator anyway and SPheno barely consumes computational capacity, we only create a neural network representation for the mapping from the masses and mixing angles to the LO cross section and K -factor.

Optimising the representations
The deep learning techniques used here are employed via artificial neural networks implemented with Keras [19] and a Tensorflow [3] backend that were trained on a GPU using CUDA [34] and cuDNN [18]. The pre-and post-processing of the input data together with the neural network architecture and the machine learning model parameters form the technical realization of the deep learning representation of the function σ = σ (pMSSM- 19).
The input of the neural networks is taken from the SPheno output and consists of the electroweakino and squark masses for the LO cross sections, and gluino mass for the K -factor, as well as the relevant chargino and neutralino mixing matrix entries, and is preprocessed via the zscore normalisation: the inputs x i are transformed into x i = , where μ(x) and σ sd (x) are the mean and standard deviation of x. Whenever deemed useful, expert knowledge was applied and high-level features were formed, e.g. for the K -factor prediction, the mean of the squark masses was used, which corresponds to the calculation method employed in Prospino.
An ANN is a collection of artificial neurons, along whose connections an input is propagated. During the propagation the input is transformed depending on the network architecture and the machine learning model parameter set θ charac-terising the ANN. The output is an estimate of the function value for the given input parameters. The set θ is initially drawn from a random distribution and learned via updates from a stochastic gradient descent-like optimisation algorithm that minimises a loss function which measures the deviation between model predictions and true (known) cross sections. In our case, the loss function is the mean absolute percentage error which is minimised. The chosen optimiser is ADAM [29] with default parameters except a learning rate scheduling with initial and final learning rates α i and α f combined with EarlyStopping [13]. When an iteration has ended, i.e. when either the pre-defined maximum number of epochs is reached or EarlyStopping terminates the process, the learning rate is divided by 2, the weights giving the best validation loss so far are loaded into the architecture and the optimisation continues until α f is reached. Due to the high computational cost of hyperparameter scans, i.e. many hours to days for a single hyperparameter point and the fact that the hyperparameter space is high-dimensional with a mixture of integer and continuous dimensions, we choose a heuristic approach to determine the hyperparameters. For different processes of electroweakino production we therefore adapt different techniques to achieve MAPEs below 0.5 % and maximum errors of below 10 %.
The σ LO input is propagated through eight hidden layers with 100 neurons each and the selu [30] activation function, while for the K -factors only 32 neurons per layer are used. Note that the neural capacity of this network is low when compared to state of the art deep learning architectures [27]. However, an even lower capacity also delivered reasonable predictions but the drawbacks were that (a) training took much longer until its best performance was reached and (b) the best performance itself was worse. We chose the architecture following the suggestions for self-normalizing neural networks [30] that were specifically developed to obtain state of the art neural network models for regression and classification problems. One of its big advantages is that the selfnormalisation allows for gradients in the deeper layers that have the same order of magnitude as in the first layers which enables the ML model to learn more abstract features. The inputs are labeled with the corresponding cross section, in most cases pre-processed with a shifted logarithm such that for the input x i its label is given by or for the K -factor divided by 2 or 4, depending on the pair. The loss function is the MAPE for the K -factors and a modification thereof for the LO cross sections that takes into account the pre-processing. In the default setup, the MAPE would minimise the error on log(σ ) and it would result in sub-optimal performance. Therefore, several custom loss functions have been implemented that are constructed such that the loss function explicitly minimises the MAPE of the original values. The final data for the LO cross sections has been trained for 150 epochs and 7-10 iterations with α i = 0.0008, a patience of 50 and a batch-size of 120. Forχ 0 2χ ± 1 we extend our setup to include NNPS because of a recurring problem of outliers with large errors in problematic, often underpopulated, regions of the parameter space even with 10 7 training samples. NNPS allowed us to have a much better performance with only a fraction of the random samples. The NNPS setup was as follows: the initial training begins with 10 6 (χ 0 2χ + 1 ) or 1.5·10 6 (χ 0 2χ − 1 ) samples and runs for a short amount of time, namely 40 epochs, 5 iterations and a batch-size of 1000. The resulting neural network is then evaluated on the pool of the remaining 9·10 6 samples. The 10 5 or 1.5 · 10 5 samples for which the neural network performed worst are iteratively added to the training set. This is done 10 and 15 times respectively. The actively sampled training set is then used for a more thorough training identical to the procedure used forχ + 1χ − 1 andχ 0 2χ 0 2 . The evaluation of these networks is then investigated and in both cases the performance is further enhanced by training additional neural networks that are specialised on a fraction of the target value range. For a specialised network covering target values in the range 0.001 and 0.2 pb forχ 0 2χ − 1 , we also z-score transformed the target values without taking the logarithm. The general and specialised networks are then stacked together by using the general network to predict whether a point is predicted better by the specialised network: if that is the case, the prediction of the specialised network is returned, if not, the general network will return its prediction. Theχ 0 2χ + 1 Kfactors have been treated similarly, while for theχ 0 2χ − 1 we only used one neural network trained on random samples.

Results
In this section we present the accuracy of the tool, DeepXS, including statistical measures of its performance. We also discuss inference times and subtleties of its validity. The testing of DeepXS was performed using 10 4 pMSSM-19 points generated according to the same rules as the training samples. Table 3 shows the performance of DeepXS for the cross sections σ NLO that are larger than the threshold σ exp = 6.6 · 10 −5 pb. This threshold corresponds to the integrated luminosity 150 fb −1 , which is the data collected thus far by the LHC, and assuming 10 produced events. Therefore, for current applications this threshold provides a very conservative estimate of the observable electroweakino production. The entries for 1σ , 2σ and 3σ denote the maximum error for 68.27 %, 95.45 % and 99.73 % of the samples. We use the intervals as defined for the normal distribution motivated by the shape of the error distribution. However, we note that it has fatter tails. With all MAPEs being well below 0.5 % and a maximum error of the 3σ bands of 5.773 %, the error of the cross sections clearly is sub-dominant relative to scale and PDF uncertainty for a large majority of the presented cases. Figure 3 demonstrates a large density of points around an error of 10 −4 -10 −2 , which matches the precision of Vegas integration, 5 · 10 −3 , typically reported by Prospino. Forχ + 1χ − 1 , the maximum error on the 10 4 test samples is ≈ 3 % while for the other pairs it is O(10 %). We note that this size of uncertainty is otherwise expected to arise due to PDF and scale variation which starts with 3-4 % for high cross sections and rises to O(10 %) for high masses. The largest error is observed for two samples with an error of ≈ 10% forχ 0 2χ 0 2 andχ 0 2χ + 1 . The case ofχ 0 2χ 0 2 will be improved with NNPS in the upcoming version of the tool. Note that the dimensionality of χ 0 2χ ± 1 , d = 11 at LO and d = 12 at NLO, is much higher than forχ 0 2χ 0 2 , with dimensionality 6 and 7 respectively. When trainingχ 0 2χ ± 1 on 10 7 random samples, the predictions were much worse than they are currently forχ 0 2χ 0 2 . We thus expect that NNPS will further improve theχ 0 2χ 0 2 to the same level of precision we have achieved for the other pairs. That the overall accuracy forχ 0 2χ − 1 is better than forχ 0 2χ + 1 is due to the more thoroughly performed NNPS, which will thus be the standard for future work.
Below the threshold of σ exp , our predictions also have a MAPE of below 1% with a maximum of 0.81 % forχ 0 2χ 0 2 , lower values of ≈ 0.3 % for the mixed pairs and ≈ 0.1 % for chargino pairs. Note however, that although errors above 10 % are more frequent for σ ≤ σ exp , the PDF uncertainty is typically also high in the corresponding region of the parameter space. We observe that the neural networks predict the complicated cross section landscapes so well that the plots corresponding to the predictions and the Prospino calculations are indistinguishable by eye. Only the plot showing the relative errors reveal a handful of slight deviations of no more than O(10%), consistent with Fig. 3. The plots were created using the same 10000 points as for Fig. 3.
DeepXS is interfaced to pySLHA [16] and can process SLHA2 files [5]. Additionally, a possibility to feed in the relevant parameters via .csv and .txt files has been implemented. When providing SLHA files, DeepXS needed 72.3 s to evaluate 10 4 samples or 7.23 ms per evaluation ofχ + 1χ − 1 at LO and NLO, already making DeepXS O(10 4 ) faster than Prospino. When SLHA files are an input, DeepXS tests ifχ 0 1 is the LSP, if the light chargino mass is above 100 GeV and if the squark masses are above 500 GeV, and a warning is given if any of these conditions is not fulfilled. When text files with an array are provided, the inference of 10 7χ + 1χ − 1 predictions both at LO and NLO took 261.51 seconds on an Intel i7-4790K CPU or ≈ 26 μs per evaluation, making it ≈ 6.9 million times faster than Prospino. When predicting mixed pairs, each evaluation takes slightly longer due to the stacking and the necessity to infer from more than two neural networks. In all cases, warnings are given when the predicted cross section is lower than σ exp .

Conclusions
We presented a method that for the first time allows a fast and highly accurate approximation of cross sections that depend on a high-dimensional and complex parameter space. As the first application, we developed a novel tool, DeepXS, that enables a fast approximation of NLO cross sections for pMSSM-19 electroweakinos. Beside the incorporation of expert knowledge, it employs stacked artificial neural networks supplemented by ANN-based point selection techniques to provide fast predictions based on the full NLO calculation using Prospino. Compared to Prospino, DeepXS is more than 4 and up to 7 orders of magnitude faster, while ensuring an accuracy of 1 % for more than 95 % of the test points. Training the neural networks takes O(h) (≈ 1 for K-factors and ≈ 12 for the leading order). Note that modifications of the underlying physics model do not imply that one has to retrain starting from 0. Instead one can initialize the new ML model with the older optimum and minimize the loss function for the new case starting from there: in the machine learning literature this is a well-known and studied technique called transfer learning [41]. Should the precision requirements for supersymmetric cross sections at NLO evolve such that we need to eliminate the few remaining outliers with errors of O(10)% we can do so by creating a larger pool of samples with an even more dedicated neural network point selection procedure. Additionally we can make use of ensemble techniques [26] to boost the performance and, although computationally very expensive, use Bayesian techniques to optimize the hyperparameters of our neural network models. To enable an uncertainty estimate for individual points, future architectures to regress cross sections should include Monte Carlo Dropout [24]: in each layer a fixed fraction of the neurons is randomly deactivated during training and inference. This procedure will lead to varying predictions for a fixed input, allowing to obtain a distribution with a mean and a standard deviation of the prediction per point. As is shown in [24], this procedure converges towards a Bayesian posterior, enabling a meaningful comparison of the uncertainty of the prediction with the PDF uncertainty. Until this exists, and although very conservative, one must rely on the error maps presented in Fig. 3 to estimate an uncertainty. The tool can be found in a GitHub repository [35] including examples that show the NNPS sampling strategy. Further development will include the completion of all electroweakino pairs, extensions of the MSSM, an estimation of scale and PDF uncertainties and a merge with BSM-AI [17]. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Funded by SCOAP 3 .