$\textsf{Xsec}$: the cross-section evaluation code

The evaluation of higher-order cross-sections is an important component in the search for new physics, both at hadron colliders and elsewhere. For most new physics processes of interest, total cross-sections are known at next-to-leading order (NLO) in the strong coupling $\alpha_s$, and often beyond, via either higher-order terms at fixed powers of $\alpha_s$, or multi-emission resummation. However, the computation time for such higher-order cross-sections is prohibitively expensive, and precludes efficient evaluation in parameter-space scans beyond two dimensions. Here we describe the software tool $\textsf{xsec}$, which allows for fast evaluation of cross-sections based on the use of machine-learning regression, using distributed Gaussian processes trained on a pre-generated sample of parameter points. This first version of the code provides all NLO Minimal Supersymmetric Standard Model strong-production cross-sections at the LHC, for individual flavour final states, evaluated in a fraction of a second. Moreover, it calculates regression errors, as well as estimates of errors from higher-order contributions, from uncertainties in the parton distribution functions, and from the value of $\alpha_s$. While we focus on a specific phenomenological model of supersymmetry, the method readily generalises to any process where it is possible to generate a sufficient training sample.


Introduction
The determination of cross-sections beyond leading order (LO) is typically very computationally expensive because of the evaluation of tensorial loop integrals. This is especially so for hadronic interactions, where the loop integrals must themselves be numerically integrated over the relevant parton distribution functions (PDFs). The computational cost of evaluation per parameter point restricts the usage of next-to-leading order (NLO) cross-sections to (simplified) new physics models with only one or two relevant parameters. However, higher-order contributions can be very important in many other models. This is especially true for strong interactions, where NLO contributions can be of comparable size to the LO contribution. The physics impact of  [2]. The solid white line indicates the 95% CL exclusion contour found with GAMBIT 1.0 [3] using LO cross-sections. The solid blue line shows the corresponding ATLAS 95% CL observed exclusion limit with NLO+NLL cross-sections, with dashed blue lines showing the ±1σ theoretical cross-section uncertainty. Reproduced from Fig. 3 in Ref. [1].
this restriction is quite dramatic. In Fig. 1, reproduced from Ref. [1], we show as an example the significant differences between the limits resulting from a parameter scan of the Constrained Minimal Supersymmetric Standard Model (CMSSM) using LO rather than NLO cross-sections. This also highlights the importance of propagating theory uncertainties, e.g. from the PDFs and the scale dependence, through to the physical observables. As the uncertainties on LO hard-scattering cross-sections are typically very large, owing in part to the missing higher-order terms, such theory error propagation to the cross-sections can only really be considered representative from NLO accuracy onwards. Even at NLO, we can see in Fig. 1 that the impact is considerable, and needs to be taken into account when fitting models.
In this work we present xsec, an attempt at a general solution to the speed problem irrespective of the type of cross-section being evaluated, the size of the parameter space of the model, and the underlying precision of the calculation. We achieve this by performing machine-learning regression on a pre-generated training dataset consisting of cross-sections sampled from a model. In this first version of xsec, we will be focusing on strong-production cross-sections in the MSSM, but the selection of cross-sections and models will be extended in future versions.
The increasing sparsity of training data in models with larger numbers of free parameters means that a lot of care must go into performing the regression, and that reliable regression errors can only be determined on a point-by-point basis. We choose to use Gaussian process (GP) regression [4], a highly flexible Bayesian method for modelling functions. It is not restricted to pre-determined functional shapes with a set number of parameters, and is instead directly informed by the data together with some prior understanding of the underlying correlation structure of the function that it is used to model. Moreover, a point-specific regression uncertainty follows naturally from the posterior predictive distribution, which is straightforward to compute in closed form.
Training GP models involves the inversion of a matrix of the size of the number of training points. This means that while GPs are a very powerful regression tool, single GPs scale very badly with increasing training data, and training becomes infeasible beyond O(10 4 ) data points. To overcome this difficulty we use a model with factorised and distributed training, where the data are split into manageable subsets each assigned to a single GP. The regression prediction is subsequently determined by a combination of the predictions of the individual GPs [5,6].
Current codes that can evaluate a broad collection of MSSM cross-sections at NLO include Prospino 2.1 [7][8][9][10][11][12] and MadGraph 5 [13]. However, the evaluation time per parameter point on a modern CPU is in the range of minutes for a single final state. This does not include the calculation of scale, α s , nor PDF errors, which increase the evaluation time by orders of magnitude. A test of the calculation time per parameter point for all strong-production cross-sections in Prospino, including the evaluation of scale, α s , and PDF errors, when Prospino has been heavily optimised with the ifort compiler, takes around two and a half hours on a single Intel Xeon-Gold 6138 (Skylake) core running at 2.0 GHz.
Fast evaluation of strong-production cross-sections in the MSSM at NLO, with next-to-leading (NLL) and next-to-next-to-leading (NNLL) logarithmic resummation, already exists in the form of the NLL-fast and NNLL-fast codes [14][15][16][17][18][19][20][21][22][23][24], which add NLL and NNLL corrections to existing NLO results from Prospino. However, these results are restricted to interpolation in a two-dimensional subspace spanned by the gluino mass and a common squark mass, and results are given as a sum over outgoing squark flavours.
In addition to the processes found in Prospino, crosssections for gaugino and slepton pair production, as well as gaugino-gluino production, can be evaluated at NLO+NLL precision by the Resummino code [25][26][27][28][29][30][31][32][33]. Here, the total running time for all chargino and neutralino production processes at NLO (again on a single 2.0 GHz Intel Xeon-Gold 6138 Skylake core) is around three hours, with no evaluation of scale, α s , nor PDF errors. Combined NLO+NLL precision takes four days.
Recently, the DeepXS package [34] appeared, which performs a fast evaluation of neutralino and chargino pair production using neural networks. This is currently limited to a small set of the most phenomenologically relevant processes in the MSSM-19, and does not include uncertainties arising from PDFs, or the value of α s .
The xsec 1.0 code is currently designed to reproduce all the NLO results of Prospino 2.1 for strong production in the MSSM in a small fraction of the time, including scale errors, and with the addition of PDF and α s errors based on a modern PDF set. It does not include NNLL or even NLL resummation as can be found in NNLL-fast and NLL-fast, however, unlike those codes it provides reliable results for non-degenerate squark masses, within the inherent limitations of the training set generated with Prospino. We intend to extend the code to also include NLL resummation in future releases.
Unlike NNLL-fast, xsec performs a separate evaluation of the cross-section for all distinct flavour combinations of the first two generations of squarks. The xsec code also treats sbottom and stop pair-production processes separately, but this has limited impact as the two cross-sections are the same at LO (for the same masses), 1 as Prospino -and therefore our training set -is limited to light quark initial states. A complete list of available processes can be found in Table 1.
The code is written in Python with heavy use of the NumPy [35] package for numerical calculations, and is compatible with both Python 2 and 3. It can be installed using the pip package manager and run as a set of functions from a self-contained module. We also provide interfaces through SLHA files [36] and a command-line tool.
The rest of this paper is structured as follows. We begin in Sec. 2 with an introduction to the GP regression framework that we employ. In Sec. 3 we describe our GP training regime, including how we compute the required training sets of NLO cross-sections. In Sec. 4, we perform a thorough validation of the results from xsec, with a comparison to results from existing codes. Section 5 covers the structure of the code and its interfaces, and we conclude in Sec. 6.

Gaussian process regression
The basic objective in regression is to estimate an unknown function value f (x) at some new x point, given that we know the function values at some other x points. A Bayesian approach to this task is provided by Gaussian process regression, in which we express our degree of belief about any set of function values as a joint Gaussian pdf. We underline here that this pdf should be understood in a purely Bayesian sense -it does not imply any randomness in the true function we are approximating.
We begin by defining some notation and terminology, indicating in italics the terminology commonly used in the GP and machine learning literature. Each input point x has m components (features), which in our case will correspond to masses and mixing angles from the MSSM squark and gluino sector. We denote by x * the new input point (test point) for which we will estimate the unknown, true function value f * ≡ f (x * ), here an NLO production cross-section. We let x i with i = 1, . . . , n denote the n input points (training points) at which we know the function values is referred to as the training set. The complete set of input components in our training set can be expressed as an n × m matrix X = [x 1 , . . . , x n ] T . Similarly, the complete set of known function values can be collected in a vector f = [f 1 , . . . , f n ] T . Thus, our training set can also be expressed as D = {X, f }.
The starting point for GP regression is the formulation of a joint Gaussian prior pdf 2 which formally describes our degree of belief for possible function values at both the training points X and the test point x * , before we look at the training data. This prior is chosen indirectly by choosing a mean function m(·) and a covariance function or kernel k(·, ·), defined to specify the following expectation values for arbitrary input points: 2 We always treat the input x points in a dataset D as known. Thus, the pdf p(D) should be understood as the pdf p(f 1 , . . . , f n |x 1 , . . . , x n ) = p(f |X), and similarly for other pdfs involving D.
We note that while the mean function and kernel are defined as functions of inputs in x space, the function values represent mean and covariance values in f space. Our joint Gaussian prior can then be expressed as where The choice and optimisation of the kernel and mean function constitute the main challenge in GP regression, and we will discuss these aspects in detail in the next sections.
Our goal is to obtain a predictive posterior pdf for the unknown function value f * at x * . From the fully specified GP prior we can now find this simply by "looking at" the training data f , i.e. by deriving from the GP prior p(D, f * |x * ) the conditional pdf The mean and variance of this univariate Gaussian can be expressed in closed form as The prediction µ * for f * is thus simply the prior mean m(x * ) plus a shift given by a weighted sum of the shifts of the known function values from their corresponding prior means, f − m(X). The weights are proportional to the covariances between the prediction at x * and the known function values at the training points X, as set by the kernel k(x * , X). The prediction variance σ 2 * is given as the prior variance k(x * , x * ) reduced by a term representing the additional information provided by the training data about the function value at x * . This naturally depends only on the kernel. We will refer to the width σ * simply as the regression error or GP prediction error, keeping in mind that it should be interpreted in a Bayesian manner.

Kernel choice and optimisation
Choosing the kernel, Eq. (3), is the main modelling step in GP regression. It effectively determines what types of functional structure the GP will be able to capture. In particular, it encodes the smoothness and the periodicity (if applicable) of the function that is being modelled, as it controls the expected correlation between function values at two different points. The choice of prior mean function, Eq. (2), is typically much less important, as we discuss at the end of this section.
The question of the optimal kernel choice is covered in more detail in Refs. [4,37]. The squared-exponential kernel is the standard choice. It results in an exponentially decreasing correlation as the Euclidean distance between two input points increases with respect to a length-scale hyperparameter . The signal variance σ 2 f is a hyperparameter containing information about the amplitude of the modelled function. This is a universal kernel [38], which means that it is in principle capable of approximating any continuous function given enough data. The infinite differentiability and exponential behaviour of this kernel typically result in a very smooth posterior mean.
However, for our purposes, the squared-exponential has some problems. Its sensitivity to changes in the function means that the length scale is usually determined by the smallest 'wiggle' in the function [37]. We hence consider also the Matérn kernel family: like the squared-exponential, these are universal and stationary, i.e., only functions of the relative positions of the two input points, but additionally incorporate a smoothness hyperparameter ν following the basic form where Γ (ν) is the gamma function and K ν is a modified Bessel function of the second kind. For the modelling of cross-section functions, we adopt the Matérn kernel class on the basis of its superior performance. This has followed significant testing and cross-validation across a number of different problems [39][40][41]. During testing we found ν = 3 2 to be optimal for our purposes, in which case Eq. (13) simplifies to To account for the fact that some directions in the input space of masses and mixing angles may have more impact on the cross-section values than others, we use an anisotropic, multiplicative Matérn kernel, where we have also included a signal variance hyperparameter σ 2 f , similar to the one in Eq. (12). Here x (d) denotes the dth component of the input vector x, and , with components d , is a vector containing one length scale per x component. The product over the dimensions of the parameter space results in points only being strongly correlated if in each dimension, their distance is small with respect to the relevant length scale.
So far we have focused on the "noise-free" case, in which the training targets f are the exact values of the true function at the training points. In this case the predictive posterior p(f * |D, x * ) collapses to a delta function when x * equals a training point. In theory this is a reasonable approach, since what we seek is a surrogate model for an expensive, but precise and deterministic numerical computation. In practice, however, allowing for some uncertainty also at the training points typically results in a more well-behaved and stable regression model. The main reason for this is that the additional wiggle-room in the modelling can ease the challenging matrix numerics of GP regression, as we will discuss in some detail in Sect. 2.3. We therefore add a "white-noise" term, to our kernel, where σ 2 is the hyperparameter that sets the amount of "noise". The effect of this term is simply to add σ 2 along the diagonal of the covariance matrix Σ, as well as to the prior variance at the test point, k(x * , x * ). It is known as homoscedastic noise, as it is the same for all data points. In GP terminology, to include this additional variance term corresponds to going from the noise-free case to a scenario with noisy training data. The targets are then considered measurements y i ≡ y(x i ) = f (x i ) + i , with the noise i , introduced in the process of performing the ith measurement, modelled by a Gaussian distribution N (0, σ 2 ). However, we remind the reader that for our case this Gaussian pdf represents an adopted effective Bayesian degree of belief in the accuracy of the training data, rather than an expression of actual random noise.
Conceptually we should then make the substitution f → y in our definitions from Sect. 2.1. Our training set becomes D = {X, y}, with y = [y 1 , . . . , y n ] T , and the GP prior becomes a joint pdf for y and y * : p(D, y * x * ) = p(y, y * X, x * ). (17) The prior mean function and kernel now specify expectation values in y space, where we note that E[y(x)] = E[f (x)] since the Gaussian noise term has zero mean. Likewise, the predictive posterior pdf becomes with mean and variance Our complete kernel is then given by Fixing ν = 3 2 , as discussed above, we are left with the set θ = {σ 2 f , , σ 2 } of undetermined hyperparameters. To be fully Bayesian, one would introduce a prior pdf p(θ) for the hyperparameters and obtain the GP posterior p(y * |D, x * ) by marginalising over θ, p(y * |D, x * ) = p(y * , θ|D, x * ) dθ = p(y * |θ, D, x * )p(θ|D) dθ ∝ p(y * |θ, D, x * )p(D|θ)p(θ) dθ. (24) In our high-dimensional case with large datasets, such integration would come at a steep computational expense, even with MCMC methods. We therefore follow the common approach of using a point estimate for the hyperparameters, found by maximising the loglikelihood function [4] log p(D|θ) = log p(y|X, θ) Finding an adequate set of hyperparameters constitutes the model training step in the GP approach. It is complicated by the fact that each optimisation step requires the computation of the inverse and determinant of the n × n covariance matrix Σ, which scales poorly with the number of training points n. To increase speed and numerical stability, Σ is generally not directly inverted in practice, and its Cholesky decomposition is used instead. In an attempt to avoid local optima, we employ the SciPy implementation of the differential evolution method [42,43], rather than performing a gradient-based search.
Recent work has demonstrated that the theoretical prediction error σ 2 * in Eq. (22) systematically underestimates the mean-squared prediction error when the hyperparameters are learned from the data [44]. As proposed there, we account for the uncertainty on the point estimate of the hyperparameter by adding a correction term to σ 2 * , derived from the Hybrid Cramér-Rao Bound. In our case, with a constant prior mean function, this extra term amounts to where 1 ≡ [1, . . . , 1]. In particular, this increases the prediction error at test points far from the training data.
Compared to the choice of kernel, the choice of the prior mean function, Eq. (18), is typically less important. Following conditioning on a sufficiently large training set, the prior gets overpowered and the posterior mean is primarily influenced by the training data through the second term in Eq. (21). For this reason the prior mean function is commonly taken to be zero everywhere. Nevertheless, it is sensible to incorporate our knowledge of the mean, and we therefore use the sample mean of the target values y as a prior mean function that is constant in x.

Regularisation of the covariance matrix
A practical challenge when training GPs is to ensure numerical stability when inverting the covariance matrix Σ. The precision of the result is controlled by the condition number κ of Σ, which can be considered a measure of the sensitivity of the inversion to roundoff error. It is computed as the ratio λ max /λ min between the highest and lowest eigenvalues of Σ, and becomes infinite for a singular matrix. The loss of numerical precision at high κ becomes most obvious when the predictive variance, computed according to Eq. (22), evaluates to a negative number. In order to prevent this problem, it is essential to understand how to control κ.
When the target values of training points are strongly correlated, their corresponding rows and columns in Σ are nearly identical. This leads to eigenvalues close to zero and a very large condition number. It has been shown that in the worst case, κ can grow linearly with the number of training points and quadratically with the signal-to-noise ratio SNR = σ f /σ [45].
Increasing the noise level improves numerical stability, as a larger diagonal contribution σ 2 to Σ enhances the difference between otherwise similar rows and columns. Therefore, we add a term to the log likelihood in Eq. (25) that penalises hyperparameter choices with extremely high signal-to-noise ratios, as suggested in Ref. [45]. Our objective function for training GPs then becomes The large exponent guarantees that situations where SNR > SNR max are the only ones where the penalty term has a significant effect. We use SNR max = 10 4 . In some cases the likelihood penalty in Eq. (27) does not decrease the condition number sufficiently to stabilise the inversion. However, choosing a lower overall value for SNR max dilutes the information in the training data to an extent that it can sometimes be fitted by noise, even when unnecessary. We therefore check the condition number after the optimisation with the penalty term, and proceed to increase the homoscedastic noise σ 2 just for the inversion step until the condition number drops below a reasonable value κ max [46]: We set κ max = 10 9 , roughly corresponding to a maximal loss of nine digits accuracy from the total of 16 in a 64bit double-precision floating-point number. These measures may seem to deteriorate the performance of our regression model, but they are necessary to ensure the numerical stability. The underlying reason is that we have essentially noiseless data, and are hitting the limits of floating-point precision in the process of calculating the GP predictions. In comparison to the scale and PDF uncertainties on the cross-sections, the resulting regression errors nevertheless remain small, as we demonstrate in Sec. 4.

Distributed Gaussian processes and prediction aggregation
With n training points, the complexity of the matrix inversion operations in Eqs. (21) and (22) scales as n 3 , making standard GP regression unsuitable for problems that require large training sets. To overcome this challenge we construct a regression model based on distributed Gaussian processes (DGPs) [5]: We partition the total training set D into d manageable subsets D i , and for each D i we train a new GP M i . These GPs are referred to as experts. The prediction from our regression model is obtained by aggregating the predictions from the individual experts. For this prediction aggregation we follow the approach know as the Generalised Robust Bayesian Committee Machine (GRBCM) [6], for which we summarise the main steps below.
First we construct a data subset D 1 ≡ D c , randomly chosen from D without replacement, which will be used to train a single communication expert M c . Next, we partition the remaining data into subsets {D i } d i=2 , each of which will serve to train one expert M i . Following Refs. [5,6], all experts are then trained simultaneously, such that they share a common set of hyperparameters. The GRBCM approach places no restrictions on how to partition the data to form the subsets {D i } d i=2 . Compared to using a simple random partition, we have noticed minor improvements with a disjoint partition, where the data is split into local subsets based on the mass parameter with the smallest length-scale hyperparameter.
The special role of the communication expert M c becomes evident at the prediction stage. For each of the experts {M i } d i=2 , we construct an improved expert M +i by replacing the corresponding dataset D i with the extended set D +i = {D i , D c }. That is, for prediction the communication dataset D c is shared by all the experts M +i . The communication expert M c serves as a common baseline to which the experts M +i can be compared. In the final combination, the prediction from expert M +i will be weighted according to the differential entropy difference between its predictive distribution and that of M c .
The central approximation that allows for computational gains in DGPs and related approaches is an assumption that the individual experts can be treated as independent, which corresponds to approximating the kernel matrix of the combined problem, i.e. without partition into experts, as block-diagonal. In the GR-BCM approach, this approximation is expressed as the conditional independence assumption That is, when the information contained in the communication set D c is known, we assume that the predictive distribution for points in subset D i should not be strongly influenced by the additional information contained in subset D j .
Using Bayes' theorem and the above independence assumption, the exact predictive distribution p(y * |D, x * ) can now be approximated as where we have introduced the weights β i for the predictions from different experts, and defined β 1 ≡ −1 + d i=2 β i . By applying Bayes' theorem again, we can express our approximation for p(y * |D, x * ) in terms of the corresponding predictive distributions from the individual experts, p +i (y * |D +i , x * ) and p c (y * |D c , x * ). Leaving out normalisation factors, the distribution for the aggregated prediction becomes with mean µ DGP and variance σ 2 DGP at x * given by Following Ref. [6], we set the weights β i to The reason for assigning weight β 2 = 1 for expert M +2 is that the transition in Eq. (29) is exact for i = 2, β 2 = 1. For each remaining expert M +i≥3 , the weight is taken to be the difference in differential entropy between the baseline predictive distribution of the communication expert, p c (y * |D c , x * ), and that of the given expert, p +i (y * |D +i , x * ). Thus, if an expert M +i provides little additional predictive power over M c , its relative influence on the aggregated prediction is low.
Requiring the experts to share a common set of hyperparameters effectively disfavours overfitting of individual experts. Moreover, the risk of overfitting is alleviated by the fact that after training, each expert is extended with the communication dataset D c that it did not see during training, and its weight in the prediction aggregation is regularised through the comparison to the communication expert.
The GRBCM split of the dataset into d experts reduces the complexity of training from n 3 to O(d(n/d) 3 = n 3 d −2 ). The memory, storage space, and evaluation all depend directly on the size of the matrix, and scale as O(n 2 ) for a regular GP, but as O(n 2 /d) in the GRBCM approach.

Sample generation
We generate the inputs for our training data calculations by sampling the physical gluino and squark masses and (for third-generation squarks) the angles describing mass mixing between gauge eigenstates. The only other parameters involved in the production of gluinos and squarks to NLO QCD are the strong coupling α s and the SM quark masses. The cross-sections depend on α s both through the matrix element and the PDF. To capture the cross-section variation due to the uncertainty on α s , we generate separate input points with α s set to 0.1180 (central value), 0.1165 (1σ lower value) and 0.1195 (1σ upper value), using the corresponding PDF sets, and train separate GPs on the ratio of crosssections obtained with the central and ±1σ values. For the SM masses we use a fixed value for the bottom and top quark masses, and assume the other quark masses to be zero.
In the sample generation we do not simply sample over a regular grid of parameter values, for three reasons: 1. Grid sampling is inefficient when one parameter is more important than the others: too many evaluations are spent on varying the less influential parameters whilst keeping the important one at a fixed value. 2. The curse of dimensionality renders this sampling technique infeasible for processes that depend on more than three or four parameters. 3. The complexity of sampling and cross-section calculation for the large number of processes that we consider (in terms of final-state squark flavours) means that it is more efficient to evaluate multiple crosssections for every parameter combination than to generate separate samples for each final state.
We sample individual baseline masses for the gluino, mg, for the first and second-generation (gauge eigenstate) squarks, mũ L , md L , mc L , ms L , mũ R , md R , mc R , and ms R , and for the third-generation (mass eigenstate) squarks mb 1 , mt 1 , mb 2 , and mt 2 . We do this in two different ways: either drawing from a uniform distribution on the interval [50,3500] GeV, or from a hybrid distribution uniform on the interval [50,150] GeV and logarithmic on the interval [150, 3500] GeV. We order the third-generation squarks in mass after sampling, so that t 1 andb 1 are by definition the lightest. We intentionally choose these sampling ranges to be slightly beyond our final claimed region of validity for the regression -typically 200-3000 GeV -in order to try to avoid large regression errors at the edges. The regions where our regression has been validated more than cover the ranges of masses of interest for the LHC. We sample the cosines of the sbottom and stop mixing angles, cos θb and cos θt, uniformly on the interval [−1, 1].
On top of our baseline sampling, we employ further sub-sampling of the particle masses in order to properly include cross-section resonances within our training set. This ensures that our training dataset is more densely sampled where the cross-sections of interest vary the most. To do this, we generate and then combine five different training sets: i) All masses sampled from the uniform prior. ii) All masses sampled from the hybrid prior. iii) Employing the uniform prior for both the mass of the gluino and for a common squark mass scale, and taking a Gaussian prior with width 50 GeV for the difference between the common squark mass scale and the masses of the individual squarks. iv) Employing the uniform prior for the mass of the gluino, the hybrid prior for a common squark mass scale, and the same Gaussian prior as in iii) to draw values for the squark masses around the common scale. v) Employing the uniform prior for a common mass scale, and using the same Gaussian prior as in iii) to draw a gluino mass and the individual squark masses around the common scale.
By joining these samples, we are able to achieve both good sampling of low masses (where cross-sections are large) via the logarithmic mass prior, and acceptable sampling of large masses via the uniform prior. Because R-parity is assumed in the MSSM, there are no new s-channel resonances in the mass range where we train our GPs, and SM resonances are too light to have any impact. Therefore, we do not have to worry directly about sampling densely in the region of resonances. However, at LO there are potential effects for gluino pair production near threshold from destructive interference between diagrams with s-channel gluons and those with t-channel squarks. For squark production, the chirality of the final-state squarks affects the contribution from t-channel gluino exchange, so that for example in squark-squark production for equal chiralities the matrix element has a zero at mg → 0, and a maximum around mg mq, wheremq is the average mass of the first and second-generation squarks. Both these effects can lead to non-monotonous behaviour of the cross-section as a function of masses (bumps and dips). In Sec. 4 we shall see this particularly for gluino pair production. This necessitates the separate samples with (near) degenerate masses, where such effects are larger.
Even though all non-degenerate squark masses enter into the LO cross-sections, to the precision of our training sample (see Sec. 3.2), only two mass parameters enter into the NLO corrections to gluino and first/secondgeneration squark production: the gluino mass mg and the averaged first and second-generation squark mass mq. Therefore, any additional interference structures present in the NLO corrections to these processes must be visible in the (mg,mq) slice of the parameter space. For stop/sbottom production, such structures should be similarly visible in the (mg, mt 1 /b 1 ) slice. As a result, we shall spend some time below in validation (Sec. 4) looking at gluino and first/second-generation squark production cross-sections in terms of mg andmq, and third-generation squark production in terms of mg and mt 1 . Table 1 gives the list of cross-sections available from xsec, and their parameter dependencies. The reader may wonder at this point why we train our GPs on the total cross-section in terms of all the non-degenerate masses, rather than simply on the NLO corrections (in terms of just mg andmq for gluino/first/second-generation squark production, supplemented with the three relevant third-generation parameters each for stop and sbottom production). One reason is that we have designed xsec to be more general than Prospino; in future releases we intend to make use of training data from other tools able to move beyond the degeneratesquark-mass approximation. The other reason is that we feel it is more convenient for a user to simply obtain full LO+NLO (and future +NLL+NNLL+. . . ) crosssections from xsec, rather than needing to install the correct LO cross-section calculator and PDF set, call them, and then combine the results. Similarly, training on and returning the full cross-section will make Table 1 List of all available cross-sections in xsec. Explicit first and second-generation squarksq i can bẽ u L ,d L ,c L ,s L ,ũ R ,d R ,c R ors R , whilstmq denotes the average over all eight of these masses. Where the charge-conjugate process is distinct from the original, xsec returns the sum of the cross-section for the process and its conjugate (as indicated; for q iq * j , the conjugate is only distinct when i = j).

Final state Variables
simultaneously including cross-sections from different calculators (chosen e.g. as most appropriate for different theories or processes) far more straightforward.

Calculation of NLO training cross-sections
We use Prospino 2.1 to generate cross-sections for our training samples. This calculates, amongst other things, NLO cross-sections for strong production processes in the MSSM for proton-proton collisions at a given centreof-mass (CoM) energy, and for a choice of renormalisation/factorisation scales. We have modified the code to set α s accordingly, and to accept generic PDF sets from LHAPDF 6.2 [47]. For the current version of xsec we have used the PDF4LHC15 nlo 30 pdfas symmetric Hessian NLO PDF set with 30 eigenvector members and two members with varied strong coupling α s (m Z ) = 0.1180 ± 0.0015 [48].
Prospino performs the PDF integral of the partonic process using the VEGAS [49] importance sampling algorithm for Monte Carlo integration. The convergence criterion leaves some numerical noise in the result, typically of the order of 10 −3 relative to the central crosssection value.
In order to obtain K-factors for gluino and first/ second-generation squark production, Prospino first calculates the LO and NLO cross-sections using a single squark mass, obtained as the average over the masses of all first and second-generation squarks, i.e. eight in total; this mass is employed even for any internal thirdgeneration squark propagators. The ratio of the LO and NLO results gives the K-factor for the process in question. Prospino then recomputes the cross-section at LO, without the assumption of an average mass, and the corresponding NLO value is found by multi-plying this LO result by the K-factor calculated for the average squark mass. The calculation proceeds similarly for stop and sbottom production, except that the third-generation squark masses are kept non-degenerate for all steps of the calculation, meaning that first and second-generation masses are still averaged in the Kfactor calculation. The final cross-sections should thus be viewed as an approximation to a fully non-degenerate squark mass NLO calculation. The effect of this assumption was investigated in Ref. [50] and found to be relatively small in most parts of parameter space.
In total, Prospino allows 141 final states containing gluinos and different flavour squarks. However, due to charge conjugation relations and NLO QCD identities in the cross-section (relating certain combinations of left and right-handed final-state squarks under the exchange of masses), only 49 distinct final states are needed to represent all processes calculated by Prospino in our training sample. These relationships are handled internally in xsec, and therefore need not directly concern the user. A list of the available cross-sections from the user's side can be found in Table 1.
For every parameter point and every process in the training sample, we calculate 34 different cross-section values. These include: i) a central value, ii) 30 values using the PDF eigenvectors, iii) one value where α s is taken to its lower value, iv) one where α s is taken to its upper value, v) a value where the renormalisation/factorisation scale is doubled, and, vi) a value where the renormalisation/factorisation scale is halved.
From the values in i) and ii) we follow the PDF4LHC guidelines to calculate the symmetric PDF uncertainty on the central value, understood to be a 68% confidence level bound. 3 Similarly, we follow the PDF4LHC prescription to calculate the 68% confidence level bound from varying α s alone, using the results from iii) and iv). We do not add the PDF and α s errors, but train different DGPs for the two errors. We note that these errors should be added in quadrature after evaluation if the user wishes to obtain a single 68% confidence bound incorporating both effects.
Further, we follow the standard lore of estimating an uncertainty associated with missing higher-order corrections by calculating the spread in cross-section values under scale variation coming from v) and vi). In xsec this is referred to as the scale uncertainty. Choosing exactly how to interpret this asymmetric uncertainty as a probability distribution (flat, Gaussian or otherwise), and whether/how to combine it with the (Gaussian) α s and PDF uncertainties, is left to the user.
In total this leaves us with seven cross-sections: the central value and the two-sided PDF, α s and scale uncertainties.

Training implementation details
For each final state, we train one large DGP with multiple experts on the central cross-section value, and five smaller regular GPs: one each on the upper and lower values arising from regularisation and factorisation scale variation, one each on the cross-sections for the variation of α s (and corresponding PDFs) to its upper and lower limit in its 68% confidence interval (these values are later symmetrised), and a single GP on the symmetric 68% confidence level PDF variation.
To improve the training, we try to simplify the target model. The span of some ten orders of magnitude in cross-sections across the sampled parameter space means that it is numerically challenging to train the DGPs on the raw cross-section numbers. Before training, we scale out part of the dependence on the finalstate masses by first multiplying the central input crosssection by the square of the (average) final-state mass, and then take the logarithm of the result. For the remaining five 'error' values, we train on the logarithm of the ratio to the central cross-section, such that after the reverse transformation, the final predictions for the ratios are strictly positive. As noted in section 2.2, we choose a constant prior mean equal to the sample average of the transformed target values, which results in GPs effectively modelling the deviation from this mean value. We also rescale all input parameters to the interval [0, 1] before training, to improve numerical stability and precision.
The transformations during training are automatically recorded and reversed at the time of evaluation. As the GPs are trained on the logarithm of the (massrescaled) cross-section, the resulting predictive distribution for the absolute cross-section is a log-normal distribution. Due to the positive skew of this distribution, we base the central cross-section estimate on the median. Specifically, let N (µ DGP (x * ), σ 2 DGP (x * )) denote the aggregated DGP predictive distribution for the logtransformed cross-section at point x * , after accounting for the constant prior shift and the mass scaling applied to the training data. The central cross-section value y xsec returned by xsec is then with associated asymmetric regression uncertainties ∆ − xsec and ∆ + xsec . These are constructed from the bounds of the 1σ credible interval exp(µ DGP ± σ DGP ): 4 More generally, the range y xsec ± m∆ ± xsec corresponds to an nσ credible interval, where n is given by so that n = m to first order in (m∆ ± xsec /y xsec ) and (∆ ± xsec /y xsec ).
We train the GPs with the union of the five training sets discussed in Sec. 3.1. In order to choose the optimal training parameters (relative sizes of the five different training samples, total number of training points, and experts per process), we have carried out a long programme of testing and cross-validation, in which we optimised training parameters separately for each process type, e.g. separately for gluino pair production and gluino-squark production. Our goal was to achieve good performance in terms of the regression error estimated by the DGP, while keeping disk size, memory footprint, training and evaluation time down to acceptable levels.

Validation
To validate the xsec cross-section results we generate three main test sets: D test , D test-tb and D MSSM-24 . The set D test is used for testing all cross-sections except for the stop/sbottom pair-production processes, which are tested with the set D test-tb . The sets D test and D test-tb contain 10 000 and 5 000 points, respectively, and in both cases the input points are sampled in the same way as for the corresponding training sets. The third set, D MSSM-24 , contains 19 000 points. We use this in the validation of all cross-sections. Here we draw the input points from the MSSM-24, as defined at Q = 1 TeV, using uniform priors for the MSSM parameters. Our definition of the MSSM-24 follows that in Ref. [3], except that we parameterise the Higgs sector using the higgsino mass parameter µ and the tree-level mass of the A 0 boson m A 0 , instead of the soft-breaking mass parameters use SoftSUSY 4.0 to calculate the physical mass spectrum from the MSSM input parameters [51,52]. In addition to the three main test sets, we also generate a number of process-specific two-dimensional parameter grids for further validation.
As part of the validation we will discuss two simple error measures and their distributions for our sets of test points. The first measure is just the relative error, which measures the relative deviation of the xsec result y xsec from the true Prospino value y prosp . By assuming some sampling prior π(x * ) for the test points x * and looking at the resulting distribution of (x * ) values, we can get a global picture of how well xsec performs for the different supersymmetric production processes included in xsec.
One of the strengths of GP regression is that the method directly provides point-wise uncertainty estimates, here in terms of ∆ − xsec (x * ) and ∆ + xsec (x * ), which are based on the width σ DGP (x * ) of the DGP predictive distribution (see Eq. 37). As discussed in Sec. 2, the point-wise regression uncertainty is connected to a Bayesian degree of belief regarding the unknown function value at x * , given the particular training set D and the modelling choices made in constructing and optimising the kernel.
An interesting question is then how this point-wise uncertainty compares to the actual deviation between y xsec and the true Prospino value y prosp across the input feature space. That is, we should also investigate the distribution of the standardised residual in our sets of test points. Here the notation ∆ ± xsec is shorthand for using ∆ + xsec when y prosp − y xsec > 0 and ∆ − xsec when y prosp − y xsec < 0. By studying the distribution of z(x * ) values for our test sets, and comparing to the unit normal distribution, we will obtain a global picture of the extent to which the point-wise regression errors are conservative or not, compared to the true deviations.
In cases where the main source of GP regression uncertainty is actual random noise in the training data, and we learn this noise level from the data by including a white-noise term, Eq. (16), in the GP kernel, we expect the residual in Eq. (40) to be distributed as N (0, 1). 5 We can understand this from a simple example with single-component input points x. Assume that the target data are noisy measurements of the form y(x) = f (x) + δ, where the noise δ is distributed as p(δ) = N (0, σ 2 noise ). Let the GP posterior predictive distribution for y(x * ) be given by N (µ GP (x * ), σ 2 GP (x * )). If the noise is the dominant uncertainty we have σ GP (x * ) ≈ σ noise , and we can express the residual as Given the generative model for the data, the first term will follow an N (0, 1) distribution. The second term is the number of standard deviations (as measured by the GP's own uncertainty) by which the GP prediction µ GP differs from the true value of the underlying function f . While this term can in general not be expected to be normally distributed, the assumption of noisedominated uncertainty implies that its contribution to the residual is 1, and hence, that the z distribution is close to N (0, 1).
On the other hand, if the true noise level is tiny and some other source of uncertainty dominates σ GP , i.e., if y ≈ f and σ GP σ noise , we get Thus, in this limit it is the generally small, and potentially non-Gaussian, second term from Eq. (41) that dominates the residual. The latter scenario is most similar to the case we have for the xsec residual in Eq. (40). The actual noise level in the Prospino training data is very small, as typically is the additional error in Eq. (26) accounting for uncertainty in the hyperparameter choice. Assuming some reasonably uninformative sampling prior π(x * ), this means that for most points the widths ∆ ± xsec (x * ) are dominated by the homoscedastic error contribution that we include to stabilise the numerics (see Sec. 2.3). As this is a global error contribution, we can expect the resulting regression error for most test points to be larger than the actual error, y prosp (x * ) − y xsec (x * ). We therefore in general expect the z(x * ) distributions to be narrow compared to N (0, 1), and not necessarily Gaussian. For comparison, we will include a graph of N (0, 1) in all our plots of residual distributions.
In the coming subsections, much of our focus will be on the regression errors and the related relative error and residual z. However, we remind the reader that the regression errors that we find are typically far subdominant to the cross-section uncertainties coming from the scale and PDF errors.

Gluino pair production
The gluino pair-production cross-section to NLO in QCD depends on the gluino mass and all the other squark masses. Naturally, the gluino mass is the dominant feature of the DGP after training. The mass-averaging approximation that Prospino uses (see Sec. 3.2) means that the average first/second-generation squark mass mq is a strong predictor of the NLO contribution. We therefore provide it to the GPs as a separate feature, i.e. as if it were part of the vector of parameters x. The importance of the individual first and second-generation squark masses roughly follows the PDF contributions from their corresponding quarks, due to LO t-channel squark exchange diagrams.
Given that the mass range of parameters (features) in the training samples is [100, 3500] GeV, we validate the cross-section on the sub-interval [200, 3000] GeV for both gluino and squark masses, where the cross-section regression has solid support from training data and the regression error is small.
In Fig. 2 we compare the gluino pair-production cross-sections predicted by xsec, presented as a function of the gluino mass, with values taken directly from Prospino (but not in the training set). For this comparison we fix the squark masses to a common value of 1 TeV. We also show the associated xsec-predicted uncertainty from the renormalisation scale and the PDFs as bands. The uncertainties from the regression provided by xsec and from α s are too small to be visible on the logarithmic scale. However, below the plot we show the residual between the xsec prediction and the Prospino value (Eq. 40), as a multiple of the xsec regression uncertainty. We observe good agreement overall with the Prospino result. As expected, the scale error dominates at low gluino masses, and the PDF error at high masses.
A particular phenomenon occurs in gluino pair production due to destructive interference between LO diagrams when mg ≈ mq, resulting in a vanishing partonic cross-section at threshold [7]. This can be found as a significant dip in the total pair-production crosssection when one or more of the squark masses become degenerate with the gluino. We show that xsec reproduces this behaviour to very good precision in Fig. 3.
Here the gluino mass is fixed to 1 TeV while the squark masses are run together as a common squark massmq. This figure also clearly demonstrates how subdominant the regression error is compared to the scale and PDF errors. Finally, Fig. 3 compares the results of xsec to the corresponding NLO result from NNLL-fast based on the same PDF set. We observe a slight systematic        difference at the ∼1% level. This is at the level of the interpolation error quoted by NNLL-fast.
The same gluino pair-production cross-section as a function of a common squark mass for a selection of different gluino masses can be found in Fig. 4, showing that xsec reproduces the feature across the whole assumed range of validity for the regression. As expected, the regions in which the cross-section changes most rapidly are the most difficult to predict, but we see that with one exception the prediction is always within 2σ of the Prospino value, over a large number of test points.
In Figs. 5 and 6, we show the potential importance of being able to deal with non-degenerate squark masses in xsec. Here, we vary theũ L andd R mass alone, respectively, with all other squarks fixed at 1 TeV and the gluino at 1.5 TeV forũ L and 2 TeV for the plot with d R . While less pronounced, qualitatively the same dip feature due to the t-channel interference as discussed above can be seen here as well. In this example, the NNLL-fast NLO result fails to reproduce the feature, as expected from its inherent assumption of degenerate squark masses. Figure 7 shows the distributions of the relative error (left) and the residual (right) for the test sets D test and D MSSM-24 . All distributions are normalised to unity, and the y-axis range is set to show all bins with nonzero values. In both sets the relative error is well below 5% for most points, and in fact there is only a single point, in the D MSSM-24 set, with an error greater than 5%. From the comparison to the unit normal distribu-tion we see that, as expected, the xsec regression error is somewhat larger than the true error, but with no apparent bias. The xsec prediction is also robust under a change of the test sample to the MSSM-24.
It is also instructive to perform two-dimensional grid scans of mass planes to show the relative error and residual as a function of two of the features at a time. Two examples of this are found in Fig. 8, where we show the result in the planes of (mg,mq) and (mg, md R ). We see that the relative errors and residuals are correlated in the mass planes, as should be expected from Gaussian processes when the dominant uncertainty is not due to random noise in the training data, but rather due to the lack of information in regions where the function is changing quickly. We also note that the regression uncertainty is largest when mg ≈ mq. This shows that the destructive interference dip seen in Figs. 3-5 is the part of this cross-section function that is the most challenging to capture in the regression. Further improvements could be made by adding extra training points, but only at significant cost to the evaluation speed, which seems unwarranted given the small regression errors compared to the other errors. Naive counting of the number of bins in the residual plots (right panels) above the 1, 2, and 3σ levels indicates that the quoted xsec regression error is in general conservative compared to the actual error (i.e. fewer bins show large residuals than expected from Gaussian statistics).

Pair production of gluinos with first or second-generation squarks
The data that we employ for training xsec 1.0 are limited by the fact that Prospino assumes flavour conservation and neglects heavy quarks in the proton PDFs. It therefore offers gluino-squark pair-production crosssections only for processes with first and/or secondgeneration squarks in the final state. However, crosssections for gluino-squark production with sbottoms or stops in the final state are expected to be very small. We also note that Prospino returns the cross-section for the sum over charge-conjugate final states, i.e.gq i +gq * i , making it pointless to train xsec separately on the two final states. In addition to these limitations, at NLO QCD the numerical value of the cross-section is identical for left and right-handed squark final states, as long as their masses are identical. Although we use this fact internally in xsec to reduce the total file size of the DGPs, the user can freely request any first or secondgeneration final-state squark.
The sizes of the gluino-squark production crosssections are naturally dominated by the gluino and finalstate squark masses. Because of flavour conservation, to the level of approximation used in Prospino's K-factor calculation, the only additional property of the model parameter space (i.e. feature) used by xsec is the average squark massmq. The range of sparticle masses over which we assume this cross-section evaluation to be valid is the same as for gluino pair production, i.e.
In Fig. 9 we show the predicted gluino-squark production cross-sections as a function of the gluino mass (left) and the individualq L masses (right). For the individual squark masses we keep all other masses at 1 TeV and change only the mass of the final-state squark. Also shown are the predicted regression, PDF and scale errors, and the residual between the xsec predictions and the Prospino values calculated for the same parameters. We see that xsec reliably predicts the contribution from individual squark final-state flavours. This is even the case in the region of very low final-state squark masses, which tests xsec on the arguably strange scenario in which the particular final-state squark is much lighter than the average mass of the first and second-generation squarks.
For thegd L andgũ L processes, the scale error is the dominant uncertainty across the full mass range in both the gluino mass and the final-state squark mass. For thegs L andgc L processes it is generally the PDF error that dominates the uncertainty, except at low gluino masses, where the scale error is more important. The fact that the regression error residuals comparing the xsec and Prospino values seem correlated between the four processes shown is due to the same training sample being used for all processes, causing the distances to the nearest, most influential training points to be the same in all cases. In Fig. 10 we show the distributions of the relative error between the xsec prediction and the corresponding Prospino results, as well as the residual, for each individual flavour final state. We use the same two test sets, D test and D MSSM-24 , as for the gluino pair-production cross-section in Fig. 7. All the distributions are normalised to unity.
The relative errors and residuals are similar across all squark flavours. There are no obvious differences in performance with the two sets of test points. The relative error distributions show that for almost all points in the test sets, the true regression error is below 10%, and xsec tends to overestimate the Prospino cross-section by a few percent. Comparing the residual and N (0, 1) distributions, we see that the predicted xsec regression uncertainty is conservative; indeed, notably more so than for the gluino pair-production cross-section (Fig. 7).
We can also compare the relative error across mass planes, which is shown in Fig. 11 separately for allgq L processes, in terms of the gluino mass and the average first and second-generation squark mass. Here the finalstate squark mass for each plotted cross-section is set equal to the average squark mass. We see that the regression error is below 8% across this plane for all four of thegd L ,gũ L ,gs L , andgc L production cross-sections.

First and second-generation squark-anti-squark pair production
We now look at the production cross-sections for squarkanti-squark pairsq L/Rq If the flavours of the two squarks are different, the process is assumed to include the charge-conjugate state. In this section we discuss only final states with first and second-generation (anti-)squarks, where the finalstate flavour may come from first and second-generation (anti-)quarks sampled from the proton.
Within the limitations set by the training data from Prospino, the LO first and second-generation squarkanti-squark cross-sections depend on the masses of the gluino and the final-state squark(s), and the NLO corrections further on the mean mass of all first and secondgeneration squarks. Again, we validate the cross-section on the sub-interval [200, 3000] GeV of the training data (in both gluino and squark mass). For squark-antisquark production, one should keep in mind that masses below the lower end of this range may be affected by resonant production through Z and W , and while xsec's reported regression error increases below 200 GeV, it cannot take these resonances into account, as they are not included in its training data. The resulting crosssections reported by xsec must thus be seen as wholly unreliable for squark masses below 50 GeV.
In NLO QCD, cross-sections for the two sets of process pairs (q Lq * L ,q Rq * R ) and (q Rq * L ,q Rq * L ) differ within each set only by an exchange of the appropriate squark masses. Removing also charge-conjugate states, an initial number of 64 independent processesq L/Rq * L/R therefore reduces to 20 unique cross-sections. To save training time and user disk space, we reuse the DGPs for the identical processes in xsec simply by employing symbolic links and mapping the masses accordingly. This is however invisible to the user.
In Fig. 12 we show the predicted first and secondgeneration squark-anti-squark production cross-sections as a function of the mean first and second-generation squark massmq (left) and the gluino mass (right). We show results for a selection of sub-processes, and for the total cross-section (black, rescaled for readability). All other masses are kept at 1 TeV. We see that xsec reliably predicts the contribution from individual squark final states, although at high squark masses the PDF error (dotted line) for some processes is consistent with zero cross-section. We also see that xsec correctly captures the contribution of the gluino t-channel diagram, which controls the cross-section when the final-state squarks have different chirality, leading to the peak in Fig. 12 (right) for L-R combinations when mg mq.
In Fig. 13 we show the distributions of the relative regression error and residual (Eqs. (39) and (40)) for the points in the test sets D test and D MSSM-24 . We have normalised all distributions to unity. The comparison to the unit normal distribution included in Fig. 13 (right column) shows that for both test sets, and all processes, the xsec regression error is conservative with respect to the true error.
The relative regression error in Fig. 13 (left column) is below 10% for all processes for the vast majority of test points. There is again a slight tendency for xsec to overestimate the Prospino cross-section values, in particular for the D MSSM-24 test set. As this set has individual flat priors for all the squark (soft) masses, which only a subset of D test has, we expect it to be more challenging to reproduce as its points are more likely to lie on the outskirts of the validation region.
The relative error distributions are most narrow for the processes producing a squark-anti-squark pair of the same type, i.e., for the fourq * LqL processes and the correspondingq * RqR processes (not shown). These crosssections are easier to model, as the final state involves only a single mass parameter. Thes * LsL andc * LcL processes have particularly small relative errors. This is likely due to the smallness of the proton PDFs for the s and c quarks, which effectively makes the gluino tchannel diagram irrelevant and thus further simplifies the parameter dependence.

First and second-generation squark pair production
This section looks at the validation of DGPs trained to predict cross-sections for squark-squark pair production,q L/Rq ( ) L/R . As should be clear from the notation, the flavours of the pair may be identical or different, and all four combinations of squark handedness are treated as separate processes. The processes are always assumed to include the charge-conjugate state.
Again, we discuss only final states with first and secondgeneration squarks, where at LO the final-state flavour comes from first and second-generation quarks in the proton.
As for squark-anti-squark production, within the limitations set by the design of Prospino, the LO first and second-generation squark-squark pair-production cross-sections depend on the mass(es) of the final-state squarks and the gluino mass, and the NLO corrections further depend on the mean mass of the first and secondgeneration squarks. We again validate our cross-sections on the sub-interval [200, 3000] GeV of the training data, for both gluino and squark masses.
If the squark masses are interchanged appropriately, the cross-sections for the process pairs (q LqL ,q RqR ) are identical in Prospino, as are those for the process pairs (q Rq L ,q RqL ). The 64 independent processesq L/Rq L/R can therefore be reduced to 20 unique cross-sections. Again we use symbolic links to reuse DGPs for processes with identical cross-sections.
In Fig. 14 we show the predicted first and secondgeneration squark-squark production cross-sections for a selection of sub-processes, as a function ofmq and mg. All other masses are kept at 1 TeV. For readability, the total cross-section is rescaled. We see that xsec reliably predicts the contribution from individual squark final-state flavours, and captures the contribution of the gluino t-channel diagram, which depends on the chirality of the final-state squarks, giving the qualitatively different behaviour of the cross-section as a function of the gluino mass seen in the right panel of Fig. 14. In Fig. 15 we show the corresponding distributions of the relative error and residual between the xsec prediction and the Prospino central cross-section values in D test and D MSSM-24 , for each individual squark-squark process. The residuals indicate that the regression errors from xsec across all squark-squark processes are generally conservative compared to the actual difference between the true and predicted cross-sections. The relative errors are below 10% for the large majority of test points, across all processes. There is some interesting process-dependent structure, however. For processes with two identical final-state squarks, e.g.ũ LũL ors LsL , the relative error is quite small, due to the dependence of the cross-section upon only three instead of four masses, making it easier to predict reliably. Processes with chirally-matched squarks, i.e. LL or RR final states, have smaller errors than LR final states. This is due to a difference in the LO matrix element, where the LR final states depend on (t − m 2 q )(u − m 2 q ) − m 2 q s, whereas LL and RR final states are proportional to m 2 g s [39]. This more complicated kinematic dependence means that the LR final states are harder to train and show larger relative errors.

Stop and sbottom pair production
Prospino includes PDF contributions only from light quarks and gluons. This means that production of thirdgeneration squarks is purely s-channel in our training samples, so only flavour-neutral squark-anti-squark final states are available, i.e. there are no processes with third-generation squarks in gluino-squark nor squarksquark production. However, the third-generation mass eigenstates used by Prospino include left-right mixing, quantified through the sbottom and stop mixing angles cos θb and cos θt, where cos θt = 1 implies that t 1 ,t 2 =t L ,t R , and similarly for sbottoms. Furthermore, Prospino does not calculate cross-sections for the production of mixedt 1t * 2 + c.c. (orb 1b * 2 + c.c.) states, as their cross-sections are of order α 4 s , resulting in negligibly small rates [9]. As a result, xsec is limited to training the third-generation processesb 1b * For the stop and sbottom pair-production crosssections, only the final-state mass enters at LO as a parameter (feature). To NLO in QCD, i.e. up to O(α 3 s ), the cross-sections depend on the stop or sbottom finalstate mass, the gluino mass, the averaged first and second-generation squark mass, and the stop or sbottom mixing angle, in roughly descending order of importance. Unlike the processes involving first and secondgeneration squarks, the sbottom and stop cross-sections also include the dependence on all the individual sbot-tom and stop masses in loops. However, for sbottom (stop) final states, the dependence on the stop (sbottom) masses was found to be very small, and we do not use them as features in the GPs.
For the NLO contributions involving heavy-quark loops, we have used top and bottom masses of m t = 172.0 GeV and m b = 4.6 GeV. However, we have checked that the effect of changing these within current experimental uncertainties is numerically irrelevant.
Due to the symmetry of the cross-section expressions at NLO, σt and similarly for sbottoms. There is also very little difference between the sbottom and stop cross-sections for the same masses. Nevertheless, xsec contains separate DGPs for all four non-zero stop and sbottom pairproduction processes included in Prospino, in preparation for future extensions. Here we will focus our validation tests on thet 1t * 1 cross-sections, and only discuss the other cross-sections when there are significant differences. We validate the xsec output for the thirdgeneration squarks over a slightly larger mass range than for the other processes ([100, 3000] GeV), as there is still considerable interest in light stops.
In Fig. 16, we show the stop pair-production crosssection predicted by xsec as a function of the stop mass, which is by far the dominant parameter in determining the cross-section. We also show the predicted scale  and PDF uncertainties, and compare to values taken directly from Prospino. In generating the plot, we have set all stop and sbottom masses degenerate, cos θt = 1, and all other masses to 1 TeV. At low stop masses we can see that the scale error dominates, while at high masses it is the PDF error, as expected. Throughout the validation range the regression error is subdominant.
The gluino mass can be important for contributions that appear at NLO. In Fig. 17 we show the dependence of thet 1t * 1 cross-section on the gluino mass, adopting a stop mass of mt 1 = 800 GeV, and 1 TeV for all other masses. Clearly, xsec fully captures the variation in the cross-section due to the gluino contribution at low gluino masses, although this dependence is rather small compared to the scale, PDF, and even α s uncertainties, which we also show. The Prospino result seems to have some numerical jitter at high gluino masses, although the effect lies well within the regression uncertainty band.
In Fig. 18, we show the stop pair-production crosssection for mt 1 = mt 2 = 800 GeV as function of the average of the first and second-generation squark masses. To maximize the potential effect from squark loops, the sbottom masses are fixed to the same value as the first and second-generation squarks. We fix the mixing angle to cos θt = 1, while the gluino is decoupled at 3 TeV to remove the more dominant NLO contributions from gluino exchange. We see that the dependence of the cross-section on the other squark masses is so weak that the DGP is relatively insensitive to it. However, we again observe that the Prospino results have some jitter at high squark masses. After some investigation of the sampling, we found that this jitter increases as the gluino mass is increased, and we tentatively interpret this as a numerical problem in the gluino-decoupling regime in Prospino.
We presentt 1t * 1 andt 2t * 2 production cross-sections as a function of the mixing angle θt in Fig. 19, for mt 1 = mt 2 = 800 GeV and all other sparticle masses set to 1 TeV. We can see that the dependence on the mixing angle is so small -of the order of 2% -that the limited resolution of the regression means that the DGPs cannot currently capture this behaviour. The error is nonetheless well within that reported by xsec.
Looking at the stop and sbottom residual distributions in D test-tb and D MSSM-24 (right panel of Fig. 20), we see that they have somewhat longer tails than other process types. The most notable feature is at large negative z, indicating a slight tendency to overestimate the cross-section reported by Prospino. This is driven by points with large (> 2 TeV) gluino masses, where we earlier observed the jitter in the training data. Overall, however, the regression errors returned by xsec for the stop and sbottom processes are conservative. The corresponding relative errors are below 10% for the vast majority of test points, with some tails.  Finally, in Fig. 21, we show the relative error between the xsec-predictedt 1t * 1 production cross-section and the Prospino value, in the planes of common stop mass and gluino mass (left), and stop mass and mixing angle (right). In these plots we set all the other masses to 1 TeV, and in the left-hand plot we fix cos θt = 1.
We find that xsec predicts the values with better than 10% accuracy in the majority of this space. However, the left-hand panel shows again the problem with numerical jitter from Prospino at large gluino (and stop) masses. We also observe a small localised area of underestimated cross-sections (positive relative error) for mt 1 ∼ 1100 GeV and mg ∼ 1000 GeV, seen in both plots. This region can also be found in Fig. 16 as a single point where the residual is at the 2σ level.

Speeding up cross-section evaluation
To achieve fast cross-section predictions the xsec code comes with fully trained GP models. These contain all relevant information about the training points, such as their inverse covariance matrix and the optimised kernel hyperparameter values. The main program, written in Python, provides a simple user interface for predicting cross-sections and their uncertainties at a given set of input parameters.
With the time-consuming processes of sampling and training carried out beforehand, two important steps remain in order to obtain predictions for a new parameter point: 1. Performing the matrix-vector multiplications given in Eqs. (21) and (22) to compute predictions from each individual GP expert; 2. Weighting and combining the individual predictions according to the GRBCM prescription, leading to Eq. (31).
As the aggregation step requires no matrix algebra, the first step dominates the computational cost of prediction. It scales in complexity as the square of the number of training points of the GP expert. The code relies on NumPy [35] functionality for these matrix operations. Significant speed gains are therefore possible if NumPy is properly linked to optimised numerical algebra libraries, like BLAS and LAPACK. Large performance differences exist between different BLAS implementations. Popular ones like Intel MKL, Open-BLAS and ATLAS support multi-threaded calculations, enabling NumPy to take advantage of multi-core machines and xsec to achieve the highest evaluation speeds. The NumPy package provides a function show_config() to list the numerical libraries it detected in the system it was built on.
The xsec code is compatible with Python 2 and 3. While the main program consists of the six interdependent modules shown in Fig. 22, the intended usage requires no explicit knowledge of this underlying structure. All high-level functions of relevance to the user are immediately accessible upon installing and importing the xsec package, as detailed in the following sections.

Installation
The xsec program is available from the Python Package Index (PyPI) and can be installed immediately with the pip package manager, by executing pip install xsec in a terminal. Alternatively, the user may wish to access the code directly from its GitHub repository 6 and install a local copy: git clone https :// github . com / jeriek / xsec . git pip install ./ xsec Due to the size of the trained GP datafiles, these are not contained in the repository, nor provided directly by pip. Instead, they are available as separate binaries in the GitHub release. We provide a script for their automatic download and extraction into a directory of choice, called by running xsec-download -gprocs -g gp dir -t process type on the command line. Here, gp dir is the chosen target directory; a new directory gprocs is created in the current working directory if this argument is omitted. The other optional argument, process type, specifies the final-state type for which data will be downloaded: gg (gluino pair production), sg (1st/2nd gen. squarkgluino pair production), ss (1st/2nd gen. squark pair production), sb (1st/2nd gen. squark-anti-squark pair production), tb (3rd gen. squark-anti-squark pair production), or all (default).

Python interface
The primary envisaged use of xsec is as a component in a parameter scan code, where the user chooses the desired particle production processes and varies the input parameters. Appendix A contains a simple example script in Python where these parameters are set by hand. Examples of SLHA file input and usage within a loop over sparticle masses are available in the GitHub repository.
To ensure that any time-consuming computations are only run when strictly necessary, the prediction process is split into four steps:

Initialisation
During the installation (see 5. 2), GP model files for all included processes are downloaded to a local directory gp dir of choice. At the start of a Python script, xsec must first be pointed to that directory: import xsec xsec.init(data_dir="gp dir")

Process selection
Subsequently, the centre-of-mass energy must be set (in GeV) through xsec.set_energy (13000) Currently, only the 13 TeV dataset is available. We expect to add further energies in the near future. Then, one or more sparticle production processes can be specified. A single process is identified by a tuple process = (pid1, pid2) containing the PDG codes of its final-state particles.
A list of such tuples is required as the argument to the loading function: This function call decompresses the GP model files (serialised into Python pickles to save space) for the specified processes, and loads them into memory.
This step only needs to happen once per process. It may take some time, as it involves a matrix multiplication to reconstruct the inverse covariance matrix of the training points from its Cholesky decomposition.
To show a list of all available trained processes, one can use xsec.list_all_xsec_processes()

Parameter input
The next step is to set the values of all relevant parameters for the selected processes. The relevant parameters are shown in Table 1. There are three ways to enter the parameters: manually setting their values one by one, setting multiple values at once with a dictionary, or reading all parameters from an SLHA file describing the supersymmetric spectrum.
The returned errors are always relative to the central cross-section, and a minus sign in the print-out makes it easy to distinguish the lower from the upper error bounds. Note that currently in xsec the PDF and α s errors are symmetric by definition, following the PDF4LHC recommendations [48], whereas the regression and scale errors are not. In order to return a symmetric α s error, we average the outputs from our two GPs trained separately on the upper and lower errors. This can be changed in the code by more advanced users if desired (for example, to allow α s to be treated more accurately as a nuisance parameter to be scanned over). The regression error is asymmetric because it comes from a log-normal distribution, while the lower (upper) scale error derives from the minimum (maximum) cross-section value obtained by doubling and halving the renormalisation/factorisation scale.
The results from an evaluation for a given parameter point can be added to an existing SLHA file in the XSECTION block 8 : results = xsec.eval_xsection() xsec.write_slha("slha path", results) The XSECTION structure does not allow for storing the regression error, so this is omitted. Furthermore, we do not predict individual cross-section values for all the different members of the PDF set used. Thus, in order to provide the PDF error, we follow the PDF4LHC guidelines [48] and give the lower (upper) bound of the 68% confidence interval by incrementing the central PDF set index by 1 (2) in the XSECTION block.
Finally, it is recommended to run xsec.finalise() after all evaluations have been completed. This creates a BibT E X file in the current working directory, listing references to all original work that has been used to provide the requested results. For advanced users, we include an option to write the decompressed GP files to disk and memory-map them for quicker access. This can reduce the memory load when many processes are requested simultaneously. The cache option can be activated by specifying xsec.init(data_dir="gp dir", use_cache=True, cache_dir="cache dir") 8 https://phystev.cnrs.fr/wiki/2013:groups:tools:slha in the initialisation step. The keyword specifying the cache directory is optional; by default a temporary directory with a random name will be created. The cache directory is removed when finalise() is called. Another option to decrease the memory load when using xsec for many final states is to load the data for only a few processes at a time, make predictions, and clean the memory before loading new processes. The latter can be done by xsec.unload_processes([process1, process2, ...]) If called without arguments, the data for all previously loaded processes will be cleared from memory.

Command-line interface
We also provide a command-line interface, where the user can supply a set of two final-state PDG codes and the values for the relevant features (parameters and averaged first and second-generation squark mass) involved in the corresponding cross-section. This is invoked as xsec 13000 1000021 1000001 -p 1000 500 500 -g gp dir This example calculates the cross-section forgũ L production at √ s = 13 TeV, using the GP data directory gp dir (see 5.2). The features in the call and their order (in this case mg, mq i andmq, all in GeV) can be gleaned from xsec 13000 1000021 1000001 --show -input Alternatively, the parameters can be read from an SLHA file located at slha path: xsec 13000 1000021 1000001 -r slha path -g gp dir Executing xsec --help in a terminal provides more details on all possible inputs. We do not recommend using the command-line interface in time-sensitive applications requiring multiple cross-section evaluations for the same process. Every evaluation from the command line invokes the (inevitably slow) load_processes() step, while a lot of time can be saved by writing a Python script where this function is called only once.

Code structure
For the interested user we briefly describe the structure of the code and the content of the various modules. Our design objective was to keep the user interface intuitive and simple enough to be readily integrated within a larger code, even in another programming language via binding libraries.
The parameters module manages a global dictionary of parameter values, and has functions that allow for setting, getting and clearing those values. Furthermore, there is a function that checks the internal consistency of the entered values, ensuring that the mean squark mass and the centre-of-mass energy are set correctly. SLHA interface functions are also collected in this module, enabling one to read parameter values from an SLHA1 file and to write the cross-section results to an XSECTION block in that file.
The features module keeps track of the set of parameter values relevant to each specific production process, using a global dictionary and functions that access it.
The gploader module contains functions and global variables related to initialisation settings and to loading pre-trained GPs into memory from pickled files. When using the cache option described in Sec. 5.3, this module is responsible for managing the cache directory on the disk.
The implementation of the GP kernels resides within the kernels module. These are mostly reproduced from the scikit-learn v0.19.2 [54] source code under the New BSD License, apart from some new kernels and a function that reconstructs kernel function objects from the kernel parameters stored in the pickled GP files.
All functions responsible for GP predictions and combining results from multiple GPs using the GR-BCM prescription are collected in the evaluation module. Each production process and cross-section type has its own data directory, containing a transform module used by the evaluation module, and relevant datafiles. The transform module reverses the specific transformations applied to the target values during training.
Internal helper functions are defined in the utils module. These are mostly related to the internal naming system for Gaussian process model files, but also include utility functions for outputting results to screen and file, as well as the collection of BibT E X references relevant to the requested processes.

Conclusions
In this paper, we have presented a new phenomenology code xsec, which allows for the fast evaluation of higher-order cross-sections through regression on pregenerated training datasets. The regression in xsec is based on Generalised Robust Bayesian Committee Machines, a method that employs distributed Gaussian processes. In addition to the central value and the re-lated regression variance provided by the Gaussian processes, we have trained separate Gaussian processes on the scale, PDF and α s errors, providing a complete ecosystem for the evaluation of cross-sections and their uncertainties.
The current version of xsec 1.0 includes all MSSM strong-production cross-sections at √ s = 13 TeV at NLO precision in QCD, separated into individual squark flavour final states, and allows for non-degenerate squark masses. We plan to make future updates of xsec that will extend the code both in terms of the included energies and processes, as well as with higher-order corrections. The method used here can also be extended to other models, so long as appropriate training sets can be generated. The production of supersymmetric particles in the MSSM serves as a good starting point however, as the absence of s-channel resonances simplifies the training. We would like to emphasise that the xsec code described here is not a new calculation of the included cross-sections. It is a regression estimate based on a pregenerated sample of cross-sections taken from existing results. Thus, users of this code should also reference the original physics results on which the results of xsec are based when using them in publications. We provide functionality within the code for easily identifying the relevant references.