Reweighting a parton shower using a neural network: the final-state case

The use of QCD calculations that include the resummation of soft-collinear logarithms via parton-shower algorithms is currently not possible in PDF fits due to the high computational cost of evaluating observables for each variation of the PDFs. Unfortunately the interpolation methods that are otherwise applied to overcome this issue are not readily generalised to all-order parton-shower contributions. Instead, we propose an approximation based on training a neural network to predict the effect of varying the input parameters of a parton shower on the cross section in a given observable bin, interpolating between the variations of a training data set. This first publication focuses on providing a proof-of-principle for the method, by varying the shower dependence on $\alpha_\text{S}$ for both a simplified shower model and a complete shower implementation for three different observables, the leading emission scale, the number of emissions and the Thrust event shape. The extension to the PDF dependence of the initial-state shower evolution that is needed for the application to PDF fits is left to a forthcoming publication.


Introduction
With the Large Hadron Collider (LHC) successfully undergoing its second run and the possible upgrade to the so-called High-Luminosity LHC, the produced collision datasets reach new levels of precision. This requires an ongoing effort to provide more precise theory predictions. A sizeable part of the overall theory uncertainty is often given by the degree to which we know the parton content of the incoming protons, parametrised by the parton density functions (PDF) [1][2][3][4], see also Ref. [5] and references therein for a recent review. In order to increase the precision of PDF determinations, it is clearly desirable to be able to include as many observables as possible in the fits. However the χ 2 minimization in these fits needs multiple re-evaluations of the underlying observables in order to converge. As a result there are strict constraints on the CPU time each re-evaluation costs: observables can be included in fits only if there is an efficient way to compute them as PDFs are varied. For instance, in nnpdf fits, all observables are written as convolutions of FK tables and PDFs at the initial scale as discussed in detail in Ref. [6]. Similarly, some results provided by Monte-Carlo event generators, e.g. NLOJet++ [7], MadGraph5_aMC@NLO [8], mcfm [9] or sherpa [10], can be projected onto an interpolation grid, which allows fast a-posteriori variations of the input parameters, because the sum over simulated events is replaced by a sum over a much smaller number of observable-dependent weights [11][12][13][14][15][16]. Unfortunately, as explained later in this work, this approach cannot be easily extended to capture the input-parameter dependences of the all-order predictions obtained by the parton-shower algorithms in Monte-Carlo generators.
Instead we present here an approximate approach to parametrise the parton-shower dependences in a way that allows for a fast, a-posteriori reweighting of the observable. Because we are dealing with binned observables, the output of the parton shower is the number of events in a given bin. In this context 'reweighting the observable' means finding the relative weight of each bin as the input parameters change, i.e. recomputing the value of the observable in each bin. Once the relative weight is known, the number of events in a bin can be easily recomputed by multiplying its original value by the new weight. A quick summary of the method is as follows. First, the calculation is repeated for a given set of input parameters. The variation of the result (in a given observable bin) across this set is then used to train a neural network (NN), effectively fitting the unknown functional form that encodes the dependences of the parton shower on the input parameters. This NN can then be used to obtain efficiently an interpolation of the observable for arbitrary values of the input parameters, so that it is suitable to use this methodology in studies that require fast a-posteriori variations.
NN techniques have been successfully applied or used exploratively in a number of topics in collider phenomenology, often with much more complex NN architectures than what we employ in this work. The topics include jet tagging/particle identification [17][18][19][20][21][22][23][24][25], event classification [26][27][28], phase-space integration [29], pile-up mitigation [30], simulating electromagnetic showers in a calorimeter [31], parameter space scans for New Physics searches [32], and of course PDF fitting [1]. Moreover, a deep NN has been proposed to mimic a parton shower algorithm [33]. However, this latter ansatz can not be applied to our goal of using all-order results in PDF fits, as it is applied on an event-by-event basis just as an ordinary parton-shower algorithm, whereas PDF fits require projections of the cross section on observables in order to achieve the fast evaluation times needed for the fit. The interplay of parton showers and NN has also been studied in [34], in which the authors investigate the role of parton-shower uncertainties and approximations in NN-based jet substructure analyses. In [35] an NN is used to determine the effective correction to a Sudakov form factor in a minlo calculation of single-top plus jet production.
In this exploratory study, we restrict ourselves for simplicity to final-state parton showers. While these are not dependent on PDFs, the problems that need to be addressed are the same that would appear in a PDF fit. This simplified setting allows us to test our ideas without getting bogged down in technicalities. The same approach can be readily generalised to include PDF-dependent initial-state emissions. In the latter case, having to deal with a much larger space of parameters entails a number of algorithmic issues that will be addressed in future studies.
In Sec. 2, we summarise the main steps in the algorithm for reweighting a parton shower on-the-fly, highlighting the fact that a straightforward generalisation of the interpolation approach used e.g. in Ref. [14] does not seem feasible. Understanding the limitations of that approach is particularly useful in order to motivate our choices on how to set up and train neural networks to reproduce this reweighting. We formulate our NN-based approach in Sec. 3 and describe the toy parton-shower implementation used for validating the approach. The validation method and its results are presented in Sec. 4. The approach is then further tested in Sec. 5 with a full shower implementation given by the default sherpa parton shower for the prediction of the Thrust event shape at a lepton-lepton collider. Finally, we give our conclusions in Sec. 6.

On-the-fly reweighting of parton-shower emissions
Parton-shower algorithms generate exclusive parton emissions starting from a simulated event of a given high-energy process. They are based on a re-formulation of the dglap evolution [36][37][38] using Sudakov form factors ∆ with an emission kernel K [39]. The Sudakov form factor gives the probability that no (resolvable) emission occur between two emission scales t low < t high : We will further specify t and K when we describe our simplified shower model in Sec. 3.1.
For now, we only need to know how parton-shower algorithms numerically generate emissions between t 0 , the starting scale of the shower usually given by a characteristic scale of the high-energy event simulated in fixed-order perturbation theory, and t IR , the infra-red cut-off scale of the parton shower, where the evolution would be typically handed over to a fragmentation algorithm to hadronise the low-energy partons.
The first step is to find the splitting scale t for the next emission. To achieve this, a random number could be used to sample the kernel K(t) of Eq. (1). However, doing this in a direct way requires K to be integrable and invertible. This is not the case for most parton-shower kernels. A way around this is the Sudakov Veto Algorithm [40][41][42][43][44][45]. Herein, one replaces K with a new kernelK, for which we know the integralK and its inverse, and which satisfiesK(t) ≥ K(t) for all t. The algorithm then goes as follows: 2. Set t →K −1 (log(R 1 ) +K(t)) with a random number R 1 , stop if t < t IR .
3. Set P acc → K(t)/K(t). If P acc > R 2 for a new random number R 2 , the emission is accepted.

Return to
Step 2.
The hit-or-miss Step 3 counter-balances sampling the "wrong" kernelK in Step 2. When the algorithm stops, we have a list of scales t acc i for the generated (i.e. accepted) emissions and a list of scales t rej i for the rejected emissions. We can call this the parton-shower history of the event in terms of t. 1 Let us ask the following question now: if we have a given shower history generated according to a kernel K, what are the relative probabilities to generate the same histories for a set of alternate kernels K k ? It turns out that these probabilities are given by [45] where we have defined the reweighting factors The acceptance probability P acc in the second form given for q k rej is defined as in Step 3 in the Sudakov Veto Algorithm.
In an event generation, storing the relative probabilities w k then allows to reconstruct the spread of an observable under the variations labelled by k in a more efficient way than if we would do separate re-runs of the simulation for each K k . This parton-shower reweighting is implemented in the three most commonly used parton-shower implementations [46][47][48]. It complements reweighting strategies for fixed-order calculations and allows a comprehensive reweighting of the perturbative parts of an event generation when the interplay between matrix elements and parton showering is properly accounted for in the reweighting [48].

The troubles with exact a-posteriori approaches
The applicability of the parton-shower reweighting described in the previous section is restricted to cases, where the required variations are known beforehand. In principle one could store all parton-shower history data needed to perform reweightings a-posteriori, but as each history will typically have O(10) accepted and O(100) rejected emissions, this is not practical, in particular because usually the reweighting will not only depend on the scales t, but also on additional kinematic variables of exclusive emissions, along with the emission channel (g → gg, g → qq, etc.), and possibly the Björken x for initial-state emissions to be able to evaluate PDF ratios that occur [48]. This is a different situation from the one we face when reweighting e.g. NLO and even NNLO calculations, where a much smaller number of values per event is required to facilitate an exact a-posteriori reweighting [49][50][51]. But even for fixed-order calculations there are applications for which an event-wise reweighting is not fast enough, as the number of events is large and possibly tens of thousands of variations are required. This is the case for instance for PDF fits, where the observables need to be constantly recomputed along the minimisation process. This is overcome in the case of fixed-order calculations by averaging over classes of events that fall into the same observable bin and reweight in the same way. By the projection onto the observable the individual event kinematics information can be discarded. The required information can be encoded using an interpolation grid in a reduced number of kinematic variables (typically the Björken x 1,2 and the factorisation scale µ 2 F , such that PDF variations can be done) [11][12][13][14][15][16].
Can we discard the individual parton-shower history information in a similar way? For example, we may want to reweight the strong coupling α S , given that K(t) ∝ α S (t). Unfortunately, in the Sudakov Veto Algorithm, we can not just factorise the ratios K k /K that occur in Eqs. (2) and even then the number of thus factorised ratios would strongly vary. Compare this to fixed-order events, where the dependence on α p S factorises trivially, and there is only a very limited set of powers p.
Can the combination of interpolation grids and the classification of parton-shower histories by their similarity with respect to the reweighting provide a way to reduce the amount of data needed? If we bin the t acc i in bins β acc and the t rej i in bins β rej , we can write an approximation for Eq. (2) (note that we drop the variation label k for now to improve readability): where t β is the value of t corresponding to the bin β, and w β is the number of emissions that fell into bin β. We could then jump to the conclusion that if we can find a way to classify similar parton-shower histories into classes c that behave similarly under variations, we could write Suppose we calculate the w β c for every t-histogram bin β during a pre-production run, and the relative proportions of events r c that feature a parton-shower history that is classified into c. We could then calculate the effect of the parton shower variation by replacing the nominal cross section σ with However, this ansatz has a critical flaw, which is the nature of the average on the left-hand side of Eq. (3). The arithmetic means on the right-hand side are in the exponent of the reweighting functions, and thus the left-hand side mean is a geometric mean. It follows that Unfortunately we need to know the right-hand side of this last equation, but for that we need to retain the individual shower history information. And hence this classification ansatz to discard that information fails.
Lacking a straightforward analytical way to solve the problem, we will therefore in the following present a proof-of-principle for using a simple neural network to find the bin-wise reweighting factors needed to vary the parton shower. The problem can be formulated in the following way. Let us assume that we want to reweight the cross section of an observable bin b as we vary a set of parameters that we will collectively denote θ. For each event generated in the Monte Carlo sample, which we label with an index i, there is a reweighting factor w b,i . As shown in Eq. (2), the event-wise reweighting factor w b,i is a functional of q acc (t; δθ), where we have written explicitly the dependence of the latter on the variation of the parameters, δθ. Different variations of the parameters yield different functions q acc (t; δθ), and hence different reweightings.
The key assumption underlying this study is that we can ignore the details of the shower history, and therefore the analytical expression for w b,i . Instead we model the dependence of the average reweighting factor for a given bin, w b , on the function q acc (t) using a neural network. After training on a discrete set of variations {δθ (k) } labelled by k ∈ T , we expect the NN to be capable of interpolating to the correct value of the reweighting factor for a generic variation. All dependencies should be sufficiently smooth that the NN can produce a satisfactory interpolation. The output of the NN can be validated against correct reweighting factors before using it in an application.
3 Neural-network approach to a-posteriori parton-shower reweighting

Simplified shower model
To formulate and validate our approach it is sufficient to use a simplified parton-shower model. First, let us further specify the kernel we use in the Sudakov form factor: The variable z gives the energy ratio between the mother parton a and its daughter b.
We define t to be such that z(z − 1)t is proportional to the transverse momentum squared of the emitted particle, p 2 T,b (with respect to a). Beyond that, the precise definition of t and the emission kinematics are not important for our purposes. However, together with the requirement t > t IR this allows us to specify what a resolvable emission is, by setting the integration limits for z using ε(t) = t IR /t. Thus, the kinematic limits on z ensure that the transverse momenta of all generated emissions is larger than t IR . This also takes care of the singularities that appear in some of the dglap splitting functions P ab = P a→bc at z = 0 and z = 1. In our simple shower, we only use the following three LO dglap splitting functions as they are listed in [52]: It remains to define the scale µ 2 (t) at which the strong coupling α S is evaluated. In the most simple shower model, one could just set it to µ 2 (t) = t. However, we do not want to move the needle too far in the direction of a simplified shower model. Instead, we keep some of the complications of currently used shower algorithms. Therefore, we set the scale to be µ 2 (t) = z(z − 1)t. This choice implicitly includes higher-order corrections in Eq. (4) [53].
We intend to sample over the now explicit z dependence and over the splitting channel a → bc (and over the different a in the parton cascade). This introduces adjustments to the Sudakov Veto Algorithm as it is described in 2.1, cf. [41]. For the reweighting, we use the integrands where z is not integrated out: and therefore we need to know (t, z) for every accepted or rejected splitting for the reweighting. In the following we will only consider reweighting as we vary the values of the strong coupling α S . Since everything else in the ratio K k /K cancels, the reweighting functions simplify: and we only need to know µ 2 for accepted emissions, and (µ 2 , P acc ) for rejected emissions.
To conclude the definition of our shower model, it remains to define the initial and cut-off conditions for the shower evolution. The starting scale t 0 would be usually determined by the high-energy event. However, we do not simulate that and instead randomly sample t 0 for each shower history from a Gaussian distribution with a mean value and standard deviation of 10 4 GeV 2 . The starting configuration is that of a single (final-state) quark line. The infra-red cut-off is chosen to be t IR = 1 GeV 2 .

Input and output data
As discussed at the end of Sect. 2.2 our goal is to replace the reweighting of single partonshower histories as described in Eqs. (5) with an a-posteriori reweighting of individual histogram bins after filling them with showered events. So, for a given variation of the parameters, δθ (k) , we need to predict the average over reweighting factors w k,i for histories of events i that fell into a given observable bin b. Let us designate this quantity as w k b . Then, in the limit of infinite statistics, we know that if with the nominal calculation we find that N b events fell into the bin b, then for the variation k we will find So if we use a neural net for this purpose, it is clear that w k b should be the value of our single output neuron, and we can train the neural net against this value (which can be obtained from an on-the-fly reweighting for the variation k, or from a separate re-run for this variation).
The input data is more ambiguous. It has to be a trade-off between precisely describing the variation and using as few data points as possible. Our input is a real vector of size N in obtained by sampling the acceptance function q k acc (µ 2 ) defined in Eq. (5b) at a set of discrete points µ 2 , where = 1, . . . , N in . We use a logarithmic distribution of µ 2 points for the sampling (i.e. with more points for lower scales), because the α S ratios we will be using change more quickly for lower scales. Note that q k acc also appears in the reweighting function for rejected events q k rej , cf. Eq. (5c), and therefore at least partly specifies how to vary rejected emissions.
For illustration purposes, we show visualisations of these input/output choices in Fig. 1.
The panel on the left shows the function q acc , the red lines correspond to the values of µ 2 at which we sample the function for the input vector. The shift in the observable in each bin is reported in the panel on the right, where the solid line shows the value of the observable for the nominal value of the parameters θ, and the dotted line shows the shift that is observed for a variation of such parameters. The NN output needs to reproduce the shift in the ratio of the nominal value and the variation value.

Neural-network architecture and training
For our neural-network implementation and training we we use the PyTorch python library [54]. The input data is fed into a layer that consists of N in = 60 neurons that are linear modules, i.e. they compute their output using a linear function y = mx + c from the input x, with the weight m and the bias c being trainable variables. As we have discussed in the previous section, the input x is given by one of N in function values of the q k acc function.
The input layer is fully connected to the next layer, which consists of 15 neurons that are rectifier linear units (ReLU). Given a value x from the input layer, they pass on the value y = max(0, x) to the output layer. It is this hidden layer that introduces a non-linearity to the network, which is surely required for the problem at hand, given Eqs. (5).
The ReLU layer is then fully connected to the output layer which is just a single neuron. It is a linear module like the ones in the input layer. Its output y gives the neural-network prediction for w k b , i.e. y = w k NN b . The training is done by minimizing the squared Euclidean distance between w k NN b and the training data w k b for a range of different variations k ∈ T , where T specifies the training data set. The loss function is defined as The optimisation step is performed using the Adam algorithm [55] as implemented in PyTorch, with a learning rate of 10 −4 . The learning passes are performed until either a preset optimisation target (Euclidean distance ≤ 10 −5 ) or a maximum number of passes (10 6 ) is reached. On a 2.8 GHz Core i7 processor this caps the CPU time for a training at approximately 30 s, although depending on the training data and on the random initialization, it can also take only a few seconds. If necessary, the training time could be further reduced by using the GPU acceleration supported by PyTorch.
Every 5000 optimisation steps it is checked if the loss function has reduced by at least 0.1 %. If this is not the case for five subsequent checks, the training is cancelled, assuming that a stable minimum has been found. If this minimum is however associated with a loss function value that is greater than 1 %, the whole training pass is discarded and repeated with a newly randomised set of NN weights. Enforcing this threshold ensures that we do not accept trainings that have become stuck in a non-global minimum.

Validation procedure and training data
In this section we compare neural-network predictions w k NN b with the true reweighting factor w k b for a range of different variations k ∈ T and observable bins b (which are defined below). In doing that we will always exclude the data point that we want to predict from the trainings.
First, we generate 10 6 parton-shower histories with the simplified model described in Sec. 3.1 and bin them one-by-one into a bin b for each observable. For each observable bin b and variation k, we calculate w k b . These values will be the output data sample for the training/comparison of our neural networks.
As the input data sample, we calculate N in = 60 function values for q k acc (µ 2 ) for each variation k ∈ T , distributed logarithmically between the lowest scale min(µ 2 ) = t IR /4 reachable by the shower and µ t 0 + 2σ t 0 , i.e. including large µ 2 values up two standard deviations σ t 0 from the mean of the starting scale distribution µ t 0 .
The general training procedure described in Sec. 3.3 is amended for the validation as follows. For predicting w k b for a given variation k and histogram bin b , we use all the input and output data for b and all k ∈ T except for k = k . Note that this means that we repeat the training for each k ∈ T (because we do not want to include the data for k in the validation). In an application on the other hand, only one neural network for each observable bin b would need to be trained to interpolate between the k.
Lastly, we repeat this N trainings = 10 times, each with randomly initialised neural net weights. Our neural net prediction for (k , b ) is then given by the mean of the outputs of these 10 neural nets given the values for q k acc as input. As an uncertainty, we give the standard deviation for these 10 values. Now let us define the training data set T . The variations used in the training should be diverse enough to allow the neural net to predict other variations accurately. We include two classes of variations of α S : where α S (µ 2 | α S (m 2 Z ) = a) is defined to be the strong coupling value at µ 2 given that the value at the Z-boson mass, α S (m 2 Z ), is set to a. For convenience we retrieve all α S values we use from nnpdf 3.1 sets [1] interfaced using lhapdf [56]. This also dictates our choice of a = α S (m Z ) variations, since we use the values available in that nnpdf release. The individual values of s and a used in the validation and the corresponding q acc functions are shown in Fig. 2.

Results
We discuss the results for the validation strategy laid down in this section so far for histogram bins of two observables: the scale t lead of the first emission in a partonshower history, and the number of accepted emissions N em in a history. Note that these observables are not physical (as we use no jet algorithm), but merely serve here to test our approach.

Leading emission scale
The first observable is the scale of the hardest emissions t lead , binned into a histogram of 8 logarithmically distributed bins between 1 and 3 · 10 4 GeV 2 . The t lead distribution for the nominal α S values and a subset of variations is shown in Fig. 3. The ratios between the NN result for the reweighting factor w k NN b and the true value w k b are shown in Fig. 4, for all variations k and observable bins b. For the interpretation of this figure, note that we actually display ratios of ratios, i.e. the deviation between reweighting factors, that are itself just ratios of the nominal and the varied cross section. To calculate how far we are off relative to the absolute value of the cross section, one would need to multiply the true reweighting factor with the ratio in Fig. 4.
It is clear from the plots that the NN results are within 1 % of the true reweighting factor, and the uncertainty of the prediction is ±3 % or less. Exceptions only appear for the s = 0.25 variation at small values of t lead . Note that s = 0.25 is an extremal variation, and hence in our validation procedure (where the to-be-predicted point is not included in the training data set) the NN needs to extrapolate when predicting this reweighting factor. The training data set should therefore go beyond the variations that are to be expected in an application, such that all NN predictions are guaranteed to be interpolations.
We have also tested how the uncertainties of the prediction behave when we initialise the NN weights with the same random numbers for each repetition of the training procedure. In that case, the spread becomes negligible compared to the results in Fig. 4. This means that the spread is not due to the numerical precision (floating point precision), but due to the random initialisation. Hence, in order to get even more precise predictions, the NN architecture and/or training procedure would have to be modified.

Number of emissions
We also test our approach with the number of emissions, in 8 bins between 0 and 7 emissions. As for the t lead , we show the N em histogram with all variations in Fig. 5, and the ratios for the reweighting factors given by the neural network and the training data in Fig. 6.
As for the t lead case we find that the neural network predictions are within 1 % of the true reweighting factor, with an uncertainty that only for some cases exceeds 3 %. The exceptions occur in the region for variations that enhance emission probabilities (small s and/or large a), in particular for bins with smaller N em values.

Further tests
In Fig. 7a (7b) we present the absolute reweighting factors for a low-and a high-t lead (N em ) bin. Here we also include simultaneous variations (i.e. both s = 1.0 and a = 0.118), the training data set is defined by all pairs (a, s) ∈ A ⊗ S with the α S (m 2 Z ) values A = {0.108, 0.110, 0.112, 0.116, 0.117, 0.118, 0.119, 0.120, 0.122, 0.124} and the scale factors S = {0.25, 0.5, 1.0, 1.5, 2.0, 4.0}. The ratio of the NN result over the true reweighting factor is shown as a projection below the absolute reweighting factors. To calculate this, we use the same validation method as before, i.e. the predicted reweighting factor is left out in the training of the NN that is used for this point. We find that the deviations are within 2 % for simultaneous variations, except for the variation a = 0.124, s = 0.25, for which emission probabilities are maximally enhanced and q acc is most non-linear. This findings are also true for the bins that are not shown in the figure.
Finally, Fig. 8 shows the behaviour of the neural-network prediction when reducing the number of neurons. For this study, we return to our original training data set T that does not include simultaneous variations. We compare our previous choice of N in = 60 with using N in = 40 and N in = 5. The hidden layer always has N in /4 neurons (which is rounded down to 1 for the N in = 5 case). Again, we only show two representative bins for both observables for the sake of brevity. Although N in = 40 only shows a minor degradation with respect to N in = 60 it features a substantially increased uncertainty for the low-N em bin and low values of the scale factor s. For N in = 5 we find significant deviations from unity, although even in this case only the low-N em bin features a deviation that is larger than 5 % (other than the extremal variations). These findings suggest that N in = 60 can probably be reduced while preserving enough precision and accuracy. However, we intend to use the validated architecture in an example more similar to potential applications in the following section. For that, we use a full parton-shower implementation, for which we foresee a stronger variation between the q acc (µ 2 ) values. Hence we keep N in = 60 as our baseline architecture.

A real-world example: varying Sherpa shower predictions for Thrust
We now study if the toy shower results from the previous sections can be transferred to a setup with a complete parton shower implementation and a real observable, namely the event shape observable Thrust for the process e + e − → 2 or more jets, simulated at a centre-of-mass energy of 91.2 GeV.
We generate Monte-Carlo events for this process using the Sherpa event generator [10] and its Catani-Seymour Shower implementation (css) [57]. Non-perturbative effects (such as fragmentation and multiple interactions) and electroweak corrections are disabled. The e + e − → qq matrix element is evaluated at leading-order in the couplings. The perturbative order of the running strong coupling is set to include up to two loops. The   shower starting scale is set to the Z mass, i.e. µ Q = m 2 Z = (91.28 GeV) 2 . The events are analysed using the Rivet analysis framework [58].
The set of variations used for the training is listed in Tab. 1. Instead of leaving out single training values as we did for the validation, we use the full training data set T and compare the predictions later with several variations that are not part of T . The N in = 60 function values for q acc used as the input data are written out from within the css coupling implementation. The output data is given by generating the Thrust distribution for each variation and then calculating the ratio to the central value (s = 1.0, a = 0.118) for each bin. Some of the bins for lower Thrust values have a sizeable Monte-Carlo error. To take this into account, we train 20 neural network replicas per bin, and for each training we generate a new w k b replica over k. For each replica we vary the w k b values according to their central value and uncertainty, assuming a Gaussian distribution.
In Fig. 9, we show the LO+parton-shower prediction for Thrust and a selection of variations for both the scale factor s and for a = α S (m 2 Z ), and compare it with data by the aleph collaboration [59]. All reweighting factors for a representative selection of bins are shown in Fig. 10, along with the prediction for the entire variation ranges by the NN. This prediction reproduces the reweighting factors that were used to train the networks (black) and also the factors that are shown as control points (red). The uncertainties of the prediction follow the Monte-Carlo errors of the training reweighting factors. We also train a second set of NN, where we omit for each training pass a random selection of 7 variations in the training (but keeping the most extremal variations). The resulting band (green) also reproduces the data.
Note that each NN corresponds to one row in Fig. 10 (i.e. to one observable bin), and therefore predicts the different functional forms for both the a and the s variations. The facts that these functions are non-linear and that their forms depend on the variation type and observable bin suggest that an ordinary fit with a fixed parametrisation is not suitable for the task.

Conclusions
Parton-shower calculations are currently not included in PDF fits, because of the CPU time needed to re-evaluate the parton shower event-by-event for new input parameters, The black ones are used to train a first set of neural networks (blue band). A random sample omitting 7 points for each variation type is used to train a second set of neural networks (green band). The most extreme variations are always kept in the training set. The uncertainties corresponds to the Monte-Carlo error of the training data as described in the text. and in particular as the PDFs are changing in the fitting process.
In this paper we suggested to use neural networks to encode the dependences of the cross section in a given observable bin on the parton-shower input parameters. We showed that the ansatz is working when applied to variations of the strong coupling (and its input scale) used in the shower splittings, both for a simplified shower model and for the full shower implementation in sherpa. The observables we tested are the leading emission scale, the number of emissions and the Thrust event shape.
This successful proof of principle makes us confident that it is worth exploring the method further, to study more observables and variation types but in particular to generalise it to take into account the PDF dependence of initial-state shower splittings. This will surely require a more advanced neural-net architecture or at least considerably more neurons and training data points, because the dependence of the splitting kernels on the PDFs is more complicated, and because varying PDF sets can not be done in just one dimension. It is nonetheless worthwhile trying to implement a fast reweighting procedure for these processes, in order to extend the range of data that can be used in PDF fits.