1 Introduction

Modern machine learning classifiers hold great promise for increasing the sensitivity of high energy physics data analyses [1,2,3,4,5,6,7,8]. Typically, a classifier is trained using simulated data and then the number of events passing a fixed threshold on the classifier output in data and in simulation is counted. A comparison between these counts is then used to estimate model parameters such as masses, couplings, and new physics cross sections. Theoretical and experimental uncertainties on the final result are accounted for by varying an aspect of the simulation and recomputing the predicted count using the nominal classifier. The uncertainties from the simulation model used for training affect the optimality of the classifier itself [9], but typically do not cause a bias and can be accounted for [10] by using parameterized classifiers [11, 12].

A variety of techniques have been proposed to render a classifier independent of a given feature [13,14,15,16,17,18,19,20,21,22,23,24]. This has become an essential tool for resonance searches, where thresholds on the classifier output must not sculpt bumps in a given spectrum so that the Standard Model background can be estimated using sideband fits. The same methodology has also been proposed to reduce the impact of systematic uncertainties on classifier-based inference [25,26,27,28]. If such a classifier does not depend on a particular nuisance parameter, then the count computed when the parameter is varied will be the same as the nominal value. This means that the uncertainty on the parameter(s) of interest will appear to be reduced.

In the case that the systematic uncertainty is decomposed into its most fundamental components, each with a clear statistical interpretation, the above would be the end of the story. The systematic uncertainty can be reduced through decorrelation and this would be useful if the classification performance does not rely strongly on the value of the nuisance parameters (otherwise, it may be better to profile instead [10]). However, theory uncertainties almost never satisfy these conditions. These uncertainties are the result of approximations when performing calculations and are also due to parameter freedom in phenomenological models that are needed when first-principles calculations are not possible. The canonical examples for these two types of uncertainties are perturbative uncertainties from series truncation and fragmentation modeling. For the former, calculations are truncated at a fixed order in perturbation theory and the result depends on unphysical scales. These scales are varied typically by factors of two in order to determine the uncertainty. Since the scales can be varied continuously, we refer to these as ‘continuous uncertainties’. Fragmentation modeling uncertainties are often evaluated by comparing two different models, such as the string model [29, 30] in the Pythia [31, 32] parton shower Monte Carlo (PSMC) and the cluster model [33, 34] in the Herwig [35, 36] PSMC. These variations are then interpreted as a one standard deviation uncertainty and combined with other sources of uncertainty in a final statistical analysis. Since only two variations of fragmentation modeling are usually available, we refer to these as ‘two-point uncertainties’.

Continuous and two-point variations are ad hoc techniques commonly used in the particle physics community to have some handle over these difficult-to-estimate uncertainties. Generating multiple simulations from different fragmentation models allows us to probe two points in an under-explored theory space of fragmentation models. The difference between these two models provides only a rough estimate of how different nature may be to either of them. Varying unphysical scales would not change the observed physics if the full calculations could be performed. The sensitivity of our simulations to these scale variations therefore provides a rough estimate of the uncertainty associated with truncating the calculations at lower order. While the numerical value of uncertainties coming from statistically interpretable origins is well trusted, the kind of theoretical uncertainties discussed above only provide a rough estimate. This is in contrast to experimental nuisance parameters (that give rise to experimental uncertainties), including the jet energy scale. Such nuisance parameters are constrained using calibration datasets. The statistical uncertainty of the control region becomes a systematic uncertainty for the experimental nuisance parameters. This justifies treating the corresponding nuisance parameters as (approximate) Gaussian random variables. A detailed discussion of the origin and validity of theory uncertainties is outside the scope of this paper.

We examine the interplay of decorrelation with theory uncertainties. In particular, we will show that constructing a classifier that is independent of a given theory nuisance parameter does not mean that the theory uncertainty is zero. Instead, it means that the only handle to determine the theory uncertainty is eliminated. Figure 1 illustrates the intuition behind why this might be the case. As concrete examples, we study fragmentation modeling in the context of classifying Lorentz-boosted W boson jet from QCD jets and factorization scale variations in the context of classifying t-channel single top quark events from W+jets events.

Fig. 1
figure 1

An illustration of the potential impact of training a classifier to be decorrelated to two-point uncertainties. The distance between Pythia and Herwig is treated as the uncertainty. Left: Without decorrelation, the uncertainty covers nature even if nature does not lie on the line connecting Pythia and Herwig. Right: The distance between Pythia and Herwig is reduced due to the decorrelation requirement, resulting in a smaller estimate of the uncertainty, which no longer covers nature. These diagrams are meant only to be intuitive illustrations

This paper is organized as follows. Section 2 briefly introduces existing decorrelation techniques. Numerical examples of both two-point and continuous uncertainties are provided in Sect. 3. The paper ends with conclusions and outlook in Sect. 4.

2 Decorrelation techniques

Let \(x\in {\mathbb {R}}^n\) be the features used for classification. Suppose that there is a featureFootnote 1\(m\in {\mathbb {R}}\) that we want to be decorrelated from a classifier \(f(x):{\mathbb {R}}^n\rightarrow {\mathbb {R}}\). One can achieve this decorrelation by minimizing the following loss functional L:

$$\begin{aligned} L[f(x)]&\,{=}\sum _{i\in S}L_{\text {class}}(f(x_i),1){+}\sum _{i\in B} w(m_i) L_{\text {class}}(f(x_i),0)\nonumber \\&\quad {+}\,\lambda \,\sum _{i\in B}L_{\text {decor}}(f(x_i),m_i), \end{aligned}$$

where S and B represent signal and background events, respectively. The loss \(L_{\text {class}}\) is the classifier loss and is often the binary cross entropy loss \(L_{\text {class}}(f(x),y) = y\log (f(x))+(1-y)\log (1-f(x))\). The function w(m) represents a weighting function and \(\lambda \) represents a hyperparameter that controls the strength of the decorrelation. Finally, \(L_{\text {decor}}\) is a term that penalizes any dependence between f and m. This last term in Eq. (2.1) is schematic as the decorrelation penalty often acts at the level of batches of events and not individual examples. Standard classification corresponds to \(w(m)=1\) and \(\lambda =0\). Decorrelation approaches include:

  • Planing [37, 38]: \(\lambda =0\) and \(w(m_i)\approx p_S(m)/p_B(m)\) so that the marginal distribution of m is non-discriminatory after the reweighting.

  • Adversaries [16, 25, 26, 28]: \(w(m)=1\), \(\lambda <0\), and \(L_{\text {decor}}\) is the loss of a second neural network (adversary) that takes f(x) as input and tries to learn some properties of m.

  • Distance Correlation (DisCo) [19, 23]: \(w(m)=1, \lambda >0\), and the last term in Eq. (2.1) is the distance correlation [39,40,41,42] between f(x) and m for the background.

  • Flatness [21]: \(w(m)=1\), \(\lambda >0\), and \(L_{\text {decor}}=\sum _m b_m \int |F_m(s)-F(s)|^2\, \mathrm{d}s\) where the sum runs over mass bins, \(b_m\) is the fraction of candidates in bin m, F is the cumulative distribution function, and \(s=f(x)\) is the classifier output. This is generalized to Moment Decorrelation (MoDE) in Ref. [24] to allow for a given dependence of f on m.

In the examples below, we focus on the adversarial case as it is the most explored in the literature. However, the same ideas apply to all decorrelation methods.

3 Numerical examples

All neural networks are implemented using Keras [43] with the Tensorflow backend [44] and optimized with Adam [45].

3.1 Two-point uncertainty: fragmentation modeling

General purpose event generators use perturbation theory when they can and phenomenological models to describe non-perturbative effects such as hadronization. The standard procedure for estimating the uncertainty due to the model choice is to compare the predictions from two different models. This uncertainty is typically largest when the analysis strategy exploits subtle correlations in the high-dimensional radiation pattern. For example, tagging the origin of high \(p_T\) jets is a widely-studied scenario [46,47,48] for machine learning whereby the detailed jet substructure can be used for classification. In this section, we study Lorentz-boosted W boson tagging, where the signal is hadronically decaying, high \(p_T\) W bosons and the background is generic quark and gluon jets. A single large-radius jet is often sufficient to capture most of the W boson decay products and its two-prong substructure is distinct from typical quark and gluon jets.

Samples were generated with MadGraph5_aMC@NLO 2.7.3 [49] for modeling pp collisions at \(\sqrt{s}\) = 13 TeV. The NNPDF23_nlo_as_0118 [50] parton distribution function is used. The hard-scattering events are passed to Pythia 8.303 [32] to simulate the parton shower and hadronization, using the default settings. Herwig 7.2.2 [35] with angularly-ordered showers and Sherpa 2.2.2 [51, 52] with default settings are also used to model the parton shower and hadronization.Footnote 2 The jets are clustered by Pyjet [53, 54] and the anti-\(k_t\) [55] algorithm with radius parameter \(R = 1.2.\)

A set of high-level jet substructure features are used to distinguish W jets from QCD jets. These features are illustrated in Fig. 2 and briefly described in the following. The kinematics are probed with the jet mass and transverse momentum. Jet substructure observables include n-subjettiness ratio \(\tau _{21}=\tau _2/\tau _1\) [56, 57], and energy correlation function ratios \(D_2^{(\beta )}=e_3^{(\beta )}/(e_2^{(\beta )})^3\) [58] and \(C_2^{(\beta )}=e_3^{(\beta )}/(e_2^{(\beta )})^2\) [59], where \(e_i\) is the normalized sum over doublets (\(i=2\)) or triplets (\(i=3\)) of constituents inside jets, weighted by the product of the constituent transverse momenta and pairwise angular distances. For this analysis, we consider both \(\beta =1\) and \(\beta =2\).

As expected, the mass peaks near the W boson mass of 80 GeV [60] for the signal and has a broad distribution for the background. The signal peak is slightly higher than the W boson mass due to underlying event and other event contamination. This could be mitigated with grooming [61,62,63,64,65]. The jet \(p_T\) is not very discriminating by construction. The two-prong nature of the signal jets is quantified by low values of \(\tau _{21}, D_2\), and \(C_2\).

Fig. 2
figure 2

The seven features used to train a classifier to distinguish boosted W boson jets from generic QCD jets events

A classifier is trained using the seven features presented in Fig. 2 to distinguish W jets from QCD jets. The nominal classifier is trained using the Pythia simulation and is parameterized as a neural network with two hidden layers of 50 nodes each. Rectified Linear Unit (ReLU) activations are used for the intermediate layers and the final output is passed through a sigmoid function. The binary cross entropy loss is used for training with a batch size of 100 and for 20 epochs. About 1 million events are used for each generator, with 50% for training and 50% for testing. None of these parameters were optimized, although minor variations were found to have little impact on performance. The performance of this nominal classifier evaluated on Pythia, Herwig, and Sherpa is shown in Fig. 3. We focus on the region near 10–15% signal efficiency, which is a typical working point for LHC analyses. In this range, the background rejection (inverse QCD efficiency) is between a few hundred and a few thousand.

A second network is trained as part of an adversarial approach. This second network uses both Pythia and Herwig events and minimizes the following loss:

$$\begin{aligned} L[f,g]&=-\left( \sum _{i\in W} \log (f(x_i))-\sum _{i\in \text {QCD}}\log (1-f(x_i))\right) \nonumber \\&\quad + \lambda \left( \sum _{i\in \text {Pythia}} \log (g(f(x_i),y_i))\right. \nonumber \\&\quad \left. -\sum _{i\in \text {Herwig}}\log (1-g(f(x_i),y_i))\right) , \end{aligned}$$

where f is the classifier, g is the adversary, \(y_i=0\) for W jets and \(y_i=1\) for QCD jets. Furthermore, \(\lambda =10\). Note that unlike Eq. (2.1), Eq. (3.1) has the labels as part of the function for the adversary. This means that the labels for the classifier are given as an input feature to the adversary, which allows the adversary to potentially learn separate decision functions for W jets and QCD jets. The classifier network f has the same composition as the nominal classifier described above: two hidden layers with 50 nodes each. The adversary has five hidden layers with 50 nodes each. As W jets are more different from QCD jets than Pythia jets are from Herwig jets, the adversary has a more difficult task, which is why g has a more complex architecture. It was found that adding the label \(y_i\) to g as well as multiplying the gradient for the adversary by 10 improved performance and stability. The minimax nature of the optimization in Eq. (3.1) is implemented by connecting the adversary to the classifier via a gradient reversal layer [66] that multiplies the gradient by a fixed negative constant during backpropagation. The classifier network is then extracted after training for 20 epochs. When \(\lambda =0\), the performance was found to be the same as for the nominal case.Footnote 3

Figure 3 shows that the performance of the adversarially trained classifier is worse than the nominal case. This drop in performance is the cost for building a classifier that is insensitive to fragmentation model variations. The difference between Pythia and Herwig for the nominal classifier is about 40% at 10% W efficiency while it is only about 20% for the adversarially trained network.Footnote 4 The reduced difference may give the impression that the adversarially trained classifier has successfully learnt to be less sensitive to fragmentation model variations. However, the difference between Sherpa and Pythia is nearly the same for the nominal and the adversarially trained classifier. This means that the ‘true’ uncertainty would be significantly underestimated if only Pythia and Herwig were available. It is often the case in an LHC analysis that only two fragmentation models are available. While the choice of Sherpa as the independent third generator is arbitrary, it is simply used in this study as a third point in the under-examined theory space of fragmentation modeling (Fig. 1), in order to demonstrate that the difference in performance of the classifier on an independent third point (whether another generator or nature) may not be well decorrelated. The result demonstrates the danger of training decorrelation methods on the same two generators that are then used to also estimate the theory uncertainty.

A curious reader may wonder why Sherpa does not lie within the range spanned between Pythia and Herwig even for the nominal classifier (also alluded to in the illustration in Fig. 1). The uncertainties from these two generators of course do not restrict all other generators to lie within them, it treated as one standard deviation uncertaintyFootnote 5 not as the maximum possible deviation. This study however reveals that applying decorrelation techniques would dramatically reduce the estimate of the uncertainty without necessarily reducing the differences to other generators or to nature.

Fig. 3
figure 3

The QCD rejection (inverse QCD efficiency) as a function of the W jet efficiency for classifiers applied to Pythia, Herwig, and Sherpa jets. The solid lines correspond to the nominal classifier trained with Pythia while the dotted lines correspond to the adversarial setup that uses both Pythia and Herwig (Sherpa is a hold-out dataset). The bottom panel shows the pull, which is the difference between Pythia and Sherpa divided by the uncertainty defined by the difference between Pythia and Herwig. While adversarial training reduces the difference in performance between Pythia and Herwig, the difference to Sherpa remains large, indicating that the true uncertainty will be underestimated if a third independent sample is unavailable

Fig. 4
figure 4

The 12 features used to train a classifier to distinguish single top events from W+jets events

Fig. 5
figure 5

The impact of factorization scale variations by a factor of 1/2 and 2, in increments of 0.1 (lighter colors are lower scales). In each case, histograms of the given observable with a particular factorization scale are normalized to unity and divided by the normalized, nominal histograms from Fig. 4. The single top NLO/LO differences are shown in grey, with the band representing the statistical uncertainty

3.2 Continuous uncertainty: higher-order corrections

The uncertainty from truncating the order of a perturbative calculation is typically estimated by varying the unphysical scales. Usually, there are renormalization scale and factorization scale uncertainties. For simplicity, we focus here on the factorization scale, which dictates the separation between long- and short-distance physics. The standard procedure is to set the factorization scale to the typical momentum transfer in the problem.

To study the impact of factorization scale variations, we consider measurements of t-channel single top quark production. One of the main backgrounds for this process is W+jets production and machine learning is already used by ATLAS [67] and CMS [68] to enhance the signal. The semileptonic channel is studied as it has a much smaller background than the all-hadronic channel. The final state is characterized by an isolated lepton, missing transverse momentum, and jets.

Events are simulated using MadGraph5_aMC@NLO (MG5_aMC) 3.1.1 [49] interfaced with Pythia 8.244 [32] for the parton shower and Delphes 3.4.2 [69,70,71] for detector simulations with the default CMS card. Particle flow candidates are used as inputs to jet clustering, implemented using FastJet 3.2.1 [54, 72] and the anti-\(k_t\) algorithm [55] with radius parameter \(R=0.5\). For simplicity, W bosons are forced to decay into muons and events are required to have at least one isolated and identified muon using the default reconstruction algorithm in Delphes. Usually, one uses the highest precision method possible and then scale variations give the uncertainty from the finite truncation of the perturbative series. In order to compare with the ‘true’ uncertainty, we artificially truncate the series early and then use the higher-order calculation as the reference uncertainty. In particular, the nominal simulation is performed at leading order (LO) in the strong coupling constant and then an additional sample for the t-channel process is simulated at next-to-leading order (NLO).

For the machine learning, events are represented by 12 numbers: the three-momentum of the muon, the four-momentum of the leading two jets, and the scalar sum of the transverse momenta of all jets (\(H_T\)). Momenta are specified by \(p_T\), \(\eta \), and \(\phi \). Histograms for each of the observables for single top t-channel and W+jets are shown in Fig. 4. The jet \(p_T\) spectra are harder for single top compared with W jets and the muons (jets) tend to be more central (forward) for single top compared with W+jets.

The impact of factorization scale variations is shown in Fig. 5. All variations are normalized to unity, as the impact on the total cross section is not relevant for per-event classification performance. As expected, the variation for all \(\phi \) observables is negligible and the biggest variation occurs for the transverse momenta.

The default performance for a classifier trained to distinguish single top events from W+jets events is shown in the top plot of Fig. 6. The W+rejection at a single top efficiency of 10% is about 75, with about 15% lower rejection when the single top is simulated at NLO. Similarly to the fragmentation modeling, an adversarial network is also trained to reduce the sensitivity to factorization scale variations. Since the scale variation is now continuous, the adversary is trained using the mean squared error:

$$\begin{aligned}&L[f,g]=-\sum _\mu \left[ \left( \sum _{i\in \text {LO single top}} w_i(\mu )\,\log (f(x_i))\right. \right. \nonumber \\&\quad \left. -\sum _{i\in \text {LO { W}+jets}}w_i(\mu )\,\log (1-f(x_i))\right) \nonumber \\&\quad \left. +\, \lambda \,\sum _{i\in \text {LO single top}} w_i(\mu )\,(g(f(x_i),y_i)-\mu )^2\right] , \end{aligned}$$

where f is the classifier, g is the adversary, w are weights, and \(\mu \) is the relative factorization scale. For each event, we can vary the factorization scale through per-event weights \(w_i\) and we use values \(\mu \in \{0.5,0.6\ldots ,1.9,2\}\) for each event. The adversarially trained classifier is therefore required to reduce the difference in its performance between samples coming from this entire range of scale variations. All hyperparameters are the same as for the fragmentation modeling example shown in the previous section. The performance of the adversarially trained classifier is shown in the bottom plot of Fig. 6. The overall performance is reduced by about a factor of 2 and the sensitivity to factorization scale variations is also significantly reduced by a factor of two or more. While the narrower uncertainty bands may give the impression that the uncertainty has been reduced, in truth the difference between the LO and NLO curves is about the same or bigger than in the nominal case. This means that the ‘true’ uncertainty would be significantly underestimated using the adversarially trained approach.

A curious reader may again wonder why the NLO curve does not lie within the uncertainty band coming from scale variations even for the nominal classifier. As in the previous study, the uncertainty bands do not reflect the maximum possible uncertainty but should rather be interpreted as a probabilistic estimate. A study of whether these bands estimates correctly the frequency of higher order computations lying within these bounds is left for a future study. In addition, for this particular example the focus was only on factorization scale variations. This study reveals how decorrelation reduces only the estimate of the uncertainty from scale variations and this does not necessarily translate to actually reducing the difference to NLO.

Fig. 6
figure 6

Top: W+jets rejection (inverse W+jets efficiency) as a function of t-channel single top efficiency for a nominal classifier. The blue band represents the uncertainty estimated by varying the factorization scale by \(\frac{1}{2}\) and 2 at LO. Bottom: the same as the top, but for the adversarially trained classifier. Adversarial training only reduces the difference in performance to factorization scale variations, not the difference to NLO, indicating that adversarial training provides a reduced estimate of the true uncertainty, which does not translate to a reduction in the true uncertainty

4 Conclusions and outlook

Decorrelation is a powerful tool for ensuring that machine learning classifiers can be used in practice to enhance analysis sensitivity. However, this tool must be used with caution. We have shown that decorrelation methods may result in significantly underestimated theory uncertainties when using standard approaches to theory uncertainty estimation. In the cases we explored, the estimated uncertainty uses two samples while the ‘true’ uncertainty relies on a third sample that is not part of the training. One could potentially incorporate the third sample into the decorrelation procedure, but there will always be another variation that is not part of the training as long as the full theory uncertainty decomposition is not known. Until we know the complete set of theory nuisance parameters, it seems prudent to not decorrelate away these uncertainties.

While this paper explicitly studied the case for decorrelation, this cautionary tale remains relevant for other uncertainty or inference aware machine learning approaches [9, 10, 20, 73,74,75,76,77,78,79,80,81,82] if they are being considered for such theory uncertainties.

5 Software and data

The software and samples for this paper can be found at https://github.com/hep-lbdl/TheoryUncertDecorrelation.