Abstract
A variety of techniques have been proposed to train machine learning classifiers that are independent of a given feature. While this can be an essential technique for enabling background estimation, it may also be useful for reducing uncertainties. We carefully examine theory uncertainties, which typically do not have a statistical origin. We will provide explicit examples of twopoint (fragmentation modeling) and continuous (higherorder corrections) uncertainties where decorrelating significantly reduces the apparent uncertainty while the true uncertainty is much larger. These results suggest that caution should be taken when using decorrelation for these types of uncertainties as long as we do not have a complete decomposition into statistically meaningful components.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Modern machine learning classifiers hold great promise for increasing the sensitivity of high energy physics data analyses [1,2,3,4,5,6,7,8]. Typically, a classifier is trained using simulated data and then the number of events passing a fixed threshold on the classifier output in data and in simulation is counted. A comparison between these counts is then used to estimate model parameters such as masses, couplings, and new physics cross sections. Theoretical and experimental uncertainties on the final result are accounted for by varying an aspect of the simulation and recomputing the predicted count using the nominal classifier. The uncertainties from the simulation model used for training affect the optimality of the classifier itself [9], but typically do not cause a bias and can be accounted for [10] by using parameterized classifiers [11, 12].
A variety of techniques have been proposed to render a classifier independent of a given feature [13,14,15,16,17,18,19,20,21,22,23,24]. This has become an essential tool for resonance searches, where thresholds on the classifier output must not sculpt bumps in a given spectrum so that the Standard Model background can be estimated using sideband fits. The same methodology has also been proposed to reduce the impact of systematic uncertainties on classifierbased inference [25,26,27,28]. If such a classifier does not depend on a particular nuisance parameter, then the count computed when the parameter is varied will be the same as the nominal value. This means that the uncertainty on the parameter(s) of interest will appear to be reduced.
In the case that the systematic uncertainty is decomposed into its most fundamental components, each with a clear statistical interpretation, the above would be the end of the story. The systematic uncertainty can be reduced through decorrelation and this would be useful if the classification performance does not rely strongly on the value of the nuisance parameters (otherwise, it may be better to profile instead [10]). However, theory uncertainties almost never satisfy these conditions. These uncertainties are the result of approximations when performing calculations and are also due to parameter freedom in phenomenological models that are needed when firstprinciples calculations are not possible. The canonical examples for these two types of uncertainties are perturbative uncertainties from series truncation and fragmentation modeling. For the former, calculations are truncated at a fixed order in perturbation theory and the result depends on unphysical scales. These scales are varied typically by factors of two in order to determine the uncertainty. Since the scales can be varied continuously, we refer to these as ‘continuous uncertainties’. Fragmentation modeling uncertainties are often evaluated by comparing two different models, such as the string model [29, 30] in the Pythia [31, 32] parton shower Monte Carlo (PSMC) and the cluster model [33, 34] in the Herwig [35, 36] PSMC. These variations are then interpreted as a one standard deviation uncertainty and combined with other sources of uncertainty in a final statistical analysis. Since only two variations of fragmentation modeling are usually available, we refer to these as ‘twopoint uncertainties’.
Continuous and twopoint variations are ad hoc techniques commonly used in the particle physics community to have some handle over these difficulttoestimate uncertainties. Generating multiple simulations from different fragmentation models allows us to probe two points in an underexplored theory space of fragmentation models. The difference between these two models provides only a rough estimate of how different nature may be to either of them. Varying unphysical scales would not change the observed physics if the full calculations could be performed. The sensitivity of our simulations to these scale variations therefore provides a rough estimate of the uncertainty associated with truncating the calculations at lower order. While the numerical value of uncertainties coming from statistically interpretable origins is well trusted, the kind of theoretical uncertainties discussed above only provide a rough estimate. This is in contrast to experimental nuisance parameters (that give rise to experimental uncertainties), including the jet energy scale. Such nuisance parameters are constrained using calibration datasets. The statistical uncertainty of the control region becomes a systematic uncertainty for the experimental nuisance parameters. This justifies treating the corresponding nuisance parameters as (approximate) Gaussian random variables. A detailed discussion of the origin and validity of theory uncertainties is outside the scope of this paper.
We examine the interplay of decorrelation with theory uncertainties. In particular, we will show that constructing a classifier that is independent of a given theory nuisance parameter does not mean that the theory uncertainty is zero. Instead, it means that the only handle to determine the theory uncertainty is eliminated. Figure 1 illustrates the intuition behind why this might be the case. As concrete examples, we study fragmentation modeling in the context of classifying Lorentzboosted W boson jet from QCD jets and factorization scale variations in the context of classifying tchannel single top quark events from W+jets events.
This paper is organized as follows. Section 2 briefly introduces existing decorrelation techniques. Numerical examples of both twopoint and continuous uncertainties are provided in Sect. 3. The paper ends with conclusions and outlook in Sect. 4.
2 Decorrelation techniques
Let \(x\in {\mathbb {R}}^n\) be the features used for classification. Suppose that there is a feature^{Footnote 1}\(m\in {\mathbb {R}}\) that we want to be decorrelated from a classifier \(f(x):{\mathbb {R}}^n\rightarrow {\mathbb {R}}\). One can achieve this decorrelation by minimizing the following loss functional L:
where S and B represent signal and background events, respectively. The loss \(L_{\text {class}}\) is the classifier loss and is often the binary cross entropy loss \(L_{\text {class}}(f(x),y) = y\log (f(x))+(1y)\log (1f(x))\). The function w(m) represents a weighting function and \(\lambda \) represents a hyperparameter that controls the strength of the decorrelation. Finally, \(L_{\text {decor}}\) is a term that penalizes any dependence between f and m. This last term in Eq. (2.1) is schematic as the decorrelation penalty often acts at the level of batches of events and not individual examples. Standard classification corresponds to \(w(m)=1\) and \(\lambda =0\). Decorrelation approaches include:

Planing [37, 38]: \(\lambda =0\) and \(w(m_i)\approx p_S(m)/p_B(m)\) so that the marginal distribution of m is nondiscriminatory after the reweighting.

Adversaries [16, 25, 26, 28]: \(w(m)=1\), \(\lambda <0\), and \(L_{\text {decor}}\) is the loss of a second neural network (adversary) that takes f(x) as input and tries to learn some properties of m.

Distance Correlation (DisCo) [19, 23]: \(w(m)=1, \lambda >0\), and the last term in Eq. (2.1) is the distance correlation [39,40,41,42] between f(x) and m for the background.

Flatness [21]: \(w(m)=1\), \(\lambda >0\), and \(L_{\text {decor}}=\sum _m b_m \int F_m(s)F(s)^2\, \mathrm{d}s\) where the sum runs over mass bins, \(b_m\) is the fraction of candidates in bin m, F is the cumulative distribution function, and \(s=f(x)\) is the classifier output. This is generalized to Moment Decorrelation (MoDE) in Ref. [24] to allow for a given dependence of f on m.
In the examples below, we focus on the adversarial case as it is the most explored in the literature. However, the same ideas apply to all decorrelation methods.
3 Numerical examples
All neural networks are implemented using Keras [43] with the Tensorflow backend [44] and optimized with Adam [45].
3.1 Twopoint uncertainty: fragmentation modeling
General purpose event generators use perturbation theory when they can and phenomenological models to describe nonperturbative effects such as hadronization. The standard procedure for estimating the uncertainty due to the model choice is to compare the predictions from two different models. This uncertainty is typically largest when the analysis strategy exploits subtle correlations in the highdimensional radiation pattern. For example, tagging the origin of high \(p_T\) jets is a widelystudied scenario [46,47,48] for machine learning whereby the detailed jet substructure can be used for classification. In this section, we study Lorentzboosted W boson tagging, where the signal is hadronically decaying, high \(p_T\) W bosons and the background is generic quark and gluon jets. A single largeradius jet is often sufficient to capture most of the W boson decay products and its twoprong substructure is distinct from typical quark and gluon jets.
Samples were generated with MadGraph5_aMC@NLO 2.7.3 [49] for modeling pp collisions at \(\sqrt{s}\) = 13 TeV. The NNPDF23_nlo_as_0118 [50] parton distribution function is used. The hardscattering events are passed to Pythia 8.303 [32] to simulate the parton shower and hadronization, using the default settings. Herwig 7.2.2 [35] with angularlyordered showers and Sherpa 2.2.2 [51, 52] with default settings are also used to model the parton shower and hadronization.^{Footnote 2} The jets are clustered by Pyjet [53, 54] and the anti\(k_t\) [55] algorithm with radius parameter \(R = 1.2.\)
A set of highlevel jet substructure features are used to distinguish W jets from QCD jets. These features are illustrated in Fig. 2 and briefly described in the following. The kinematics are probed with the jet mass and transverse momentum. Jet substructure observables include nsubjettiness ratio \(\tau _{21}=\tau _2/\tau _1\) [56, 57], and energy correlation function ratios \(D_2^{(\beta )}=e_3^{(\beta )}/(e_2^{(\beta )})^3\) [58] and \(C_2^{(\beta )}=e_3^{(\beta )}/(e_2^{(\beta )})^2\) [59], where \(e_i\) is the normalized sum over doublets (\(i=2\)) or triplets (\(i=3\)) of constituents inside jets, weighted by the product of the constituent transverse momenta and pairwise angular distances. For this analysis, we consider both \(\beta =1\) and \(\beta =2\).
As expected, the mass peaks near the W boson mass of 80 GeV [60] for the signal and has a broad distribution for the background. The signal peak is slightly higher than the W boson mass due to underlying event and other event contamination. This could be mitigated with grooming [61,62,63,64,65]. The jet \(p_T\) is not very discriminating by construction. The twoprong nature of the signal jets is quantified by low values of \(\tau _{21}, D_2\), and \(C_2\).
A classifier is trained using the seven features presented in Fig. 2 to distinguish W jets from QCD jets. The nominal classifier is trained using the Pythia simulation and is parameterized as a neural network with two hidden layers of 50 nodes each. Rectified Linear Unit (ReLU) activations are used for the intermediate layers and the final output is passed through a sigmoid function. The binary cross entropy loss is used for training with a batch size of 100 and for 20 epochs. About 1 million events are used for each generator, with 50% for training and 50% for testing. None of these parameters were optimized, although minor variations were found to have little impact on performance. The performance of this nominal classifier evaluated on Pythia, Herwig, and Sherpa is shown in Fig. 3. We focus on the region near 10–15% signal efficiency, which is a typical working point for LHC analyses. In this range, the background rejection (inverse QCD efficiency) is between a few hundred and a few thousand.
A second network is trained as part of an adversarial approach. This second network uses both Pythia and Herwig events and minimizes the following loss:
where f is the classifier, g is the adversary, \(y_i=0\) for W jets and \(y_i=1\) for QCD jets. Furthermore, \(\lambda =10\). Note that unlike Eq. (2.1), Eq. (3.1) has the labels as part of the function for the adversary. This means that the labels for the classifier are given as an input feature to the adversary, which allows the adversary to potentially learn separate decision functions for W jets and QCD jets. The classifier network f has the same composition as the nominal classifier described above: two hidden layers with 50 nodes each. The adversary has five hidden layers with 50 nodes each. As W jets are more different from QCD jets than Pythia jets are from Herwig jets, the adversary has a more difficult task, which is why g has a more complex architecture. It was found that adding the label \(y_i\) to g as well as multiplying the gradient for the adversary by 10 improved performance and stability. The minimax nature of the optimization in Eq. (3.1) is implemented by connecting the adversary to the classifier via a gradient reversal layer [66] that multiplies the gradient by a fixed negative constant during backpropagation. The classifier network is then extracted after training for 20 epochs. When \(\lambda =0\), the performance was found to be the same as for the nominal case.^{Footnote 3}
Figure 3 shows that the performance of the adversarially trained classifier is worse than the nominal case. This drop in performance is the cost for building a classifier that is insensitive to fragmentation model variations. The difference between Pythia and Herwig for the nominal classifier is about 40% at 10% W efficiency while it is only about 20% for the adversarially trained network.^{Footnote 4} The reduced difference may give the impression that the adversarially trained classifier has successfully learnt to be less sensitive to fragmentation model variations. However, the difference between Sherpa and Pythia is nearly the same for the nominal and the adversarially trained classifier. This means that the ‘true’ uncertainty would be significantly underestimated if only Pythia and Herwig were available. It is often the case in an LHC analysis that only two fragmentation models are available. While the choice of Sherpa as the independent third generator is arbitrary, it is simply used in this study as a third point in the underexamined theory space of fragmentation modeling (Fig. 1), in order to demonstrate that the difference in performance of the classifier on an independent third point (whether another generator or nature) may not be well decorrelated. The result demonstrates the danger of training decorrelation methods on the same two generators that are then used to also estimate the theory uncertainty.
A curious reader may wonder why Sherpa does not lie within the range spanned between Pythia and Herwig even for the nominal classifier (also alluded to in the illustration in Fig. 1). The uncertainties from these two generators of course do not restrict all other generators to lie within them, it treated as one standard deviation uncertainty^{Footnote 5} not as the maximum possible deviation. This study however reveals that applying decorrelation techniques would dramatically reduce the estimate of the uncertainty without necessarily reducing the differences to other generators or to nature.
3.2 Continuous uncertainty: higherorder corrections
The uncertainty from truncating the order of a perturbative calculation is typically estimated by varying the unphysical scales. Usually, there are renormalization scale and factorization scale uncertainties. For simplicity, we focus here on the factorization scale, which dictates the separation between long and shortdistance physics. The standard procedure is to set the factorization scale to the typical momentum transfer in the problem.
To study the impact of factorization scale variations, we consider measurements of tchannel single top quark production. One of the main backgrounds for this process is W+jets production and machine learning is already used by ATLAS [67] and CMS [68] to enhance the signal. The semileptonic channel is studied as it has a much smaller background than the allhadronic channel. The final state is characterized by an isolated lepton, missing transverse momentum, and jets.
Events are simulated using MadGraph5_aMC@NLO (MG5_aMC) 3.1.1 [49] interfaced with Pythia 8.244 [32] for the parton shower and Delphes 3.4.2 [69,70,71] for detector simulations with the default CMS card. Particle flow candidates are used as inputs to jet clustering, implemented using FastJet 3.2.1 [54, 72] and the anti\(k_t\) algorithm [55] with radius parameter \(R=0.5\). For simplicity, W bosons are forced to decay into muons and events are required to have at least one isolated and identified muon using the default reconstruction algorithm in Delphes. Usually, one uses the highest precision method possible and then scale variations give the uncertainty from the finite truncation of the perturbative series. In order to compare with the ‘true’ uncertainty, we artificially truncate the series early and then use the higherorder calculation as the reference uncertainty. In particular, the nominal simulation is performed at leading order (LO) in the strong coupling constant and then an additional sample for the tchannel process is simulated at nexttoleading order (NLO).
For the machine learning, events are represented by 12 numbers: the threemomentum of the muon, the fourmomentum of the leading two jets, and the scalar sum of the transverse momenta of all jets (\(H_T\)). Momenta are specified by \(p_T\), \(\eta \), and \(\phi \). Histograms for each of the observables for single top tchannel and W+jets are shown in Fig. 4. The jet \(p_T\) spectra are harder for single top compared with W jets and the muons (jets) tend to be more central (forward) for single top compared with W+jets.
The impact of factorization scale variations is shown in Fig. 5. All variations are normalized to unity, as the impact on the total cross section is not relevant for perevent classification performance. As expected, the variation for all \(\phi \) observables is negligible and the biggest variation occurs for the transverse momenta.
The default performance for a classifier trained to distinguish single top events from W+jets events is shown in the top plot of Fig. 6. The W+rejection at a single top efficiency of 10% is about 75, with about 15% lower rejection when the single top is simulated at NLO. Similarly to the fragmentation modeling, an adversarial network is also trained to reduce the sensitivity to factorization scale variations. Since the scale variation is now continuous, the adversary is trained using the mean squared error:
where f is the classifier, g is the adversary, w are weights, and \(\mu \) is the relative factorization scale. For each event, we can vary the factorization scale through perevent weights \(w_i\) and we use values \(\mu \in \{0.5,0.6\ldots ,1.9,2\}\) for each event. The adversarially trained classifier is therefore required to reduce the difference in its performance between samples coming from this entire range of scale variations. All hyperparameters are the same as for the fragmentation modeling example shown in the previous section. The performance of the adversarially trained classifier is shown in the bottom plot of Fig. 6. The overall performance is reduced by about a factor of 2 and the sensitivity to factorization scale variations is also significantly reduced by a factor of two or more. While the narrower uncertainty bands may give the impression that the uncertainty has been reduced, in truth the difference between the LO and NLO curves is about the same or bigger than in the nominal case. This means that the ‘true’ uncertainty would be significantly underestimated using the adversarially trained approach.
A curious reader may again wonder why the NLO curve does not lie within the uncertainty band coming from scale variations even for the nominal classifier. As in the previous study, the uncertainty bands do not reflect the maximum possible uncertainty but should rather be interpreted as a probabilistic estimate. A study of whether these bands estimates correctly the frequency of higher order computations lying within these bounds is left for a future study. In addition, for this particular example the focus was only on factorization scale variations. This study reveals how decorrelation reduces only the estimate of the uncertainty from scale variations and this does not necessarily translate to actually reducing the difference to NLO.
4 Conclusions and outlook
Decorrelation is a powerful tool for ensuring that machine learning classifiers can be used in practice to enhance analysis sensitivity. However, this tool must be used with caution. We have shown that decorrelation methods may result in significantly underestimated theory uncertainties when using standard approaches to theory uncertainty estimation. In the cases we explored, the estimated uncertainty uses two samples while the ‘true’ uncertainty relies on a third sample that is not part of the training. One could potentially incorporate the third sample into the decorrelation procedure, but there will always be another variation that is not part of the training as long as the full theory uncertainty decomposition is not known. Until we know the complete set of theory nuisance parameters, it seems prudent to not decorrelate away these uncertainties.
While this paper explicitly studied the case for decorrelation, this cautionary tale remains relevant for other uncertainty or inference aware machine learning approaches [9, 10, 20, 73,74,75,76,77,78,79,80,81,82] if they are being considered for such theory uncertainties.
5 Software and data
The software and samples for this paper can be found at https://github.com/heplbdl/TheoryUncertDecorrelation.
Data Availability Statement
This manuscript has associated data in a data repository. [Authors’ comment: These data can be found in the Github repository linked above.]
Notes
This also applies to cases where m is multidimensional, but we restrict to the onedimensional setting here for simplicity and because it is widely used.
While Herwig and Sherpa both use a cluster model for fragmentation, the actual Sherpa implementation is based on [34] and differs from Herwig in several respects.
Note that when \(\lambda =0\), the adversarial setup is slightly different than the nominal configuration because both Pythia and Herwig are used for training. This has little impact on the results – see Appendix A.
It is possible this could be reduced with further hyperparameter tuning. We found some parameters that made this smaller, but with significant variation across trainings. The configuration reported here was found to be robust to retraining.
To test the reliability of this uncertainty, one would need a large number of generators that span all possible ways of describing fragmentation modeling, and check how often they lie within the uncertainty bands; this is not possible in reality.
References
A.J. Larkoski, I. Moult, B. Nachman, Jet substructure at the large hadron collider: a review of recent advances in theory and machine learning. Phys. Rep. 841, 1–63 (2020). https://doi.org/10.1016/j.physrep.2019.11.001arXiv:1709.04464
D. Guest, K. Cranmer, D. Whiteson, Deep learning and its application to LHC physics. Annu. Rev. Nucl. Part. Sci. 68, 161–181 (2018). https://doi.org/10.1146/annurevnucl101917021019arXiv:1806.11484
K. Albertsson et al., Machine learning in high energy physics community white paper. arXiv:1807.02876
A. Radovic, M. Williams, D. Rousseau, M. Kagan, D. Bonacorsi, A. Himmel et al., Machine learning at the energy and intensity frontiers of particle physics. Nature 560, 41–48 (2018). https://doi.org/10.1038/s4158601803612
G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby et al., Machine learning and the physical sciences. Rev. Mod. Phys. 91, 045002 (2019). https://doi.org/10.1103/RevModPhys.91.045002arXiv:1903.10563
D. Bourilkov, Machine and deep learning applications in particle physics. Int. J. Mod. Phys. A 34, 1930019 (2020). https://doi.org/10.1142/S0217751X19300199arXiv:1912.08245
M.D. Schwartz, Modern machine learning and particle physics. arXiv:2103.12226
M. Feickert, B. Nachman, A living review of machine learning for particle physics. arXiv:2102.02770
B. Nachman, A guide for deploying Deep Learning in LHC searches: how to achieve optimality and account for uncertainty. SciPost Phys. 8, 090 (2020). https://doi.org/10.21468/SciPostPhys.8.6.090arXiv:1909.03081
A. Ghosh, B. Nachman, D. Whiteson, Uncertainty aware learning for high energy physics. Phys. Rev. D 104, 056026 (2021)
K. Cranmer, J. Pavez, G. Louppe, Approximating likelihood ratios with calibrated discriminative classifiers. arXiv:1506.02169
P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, D. Whiteson, Parameterized neural networks for highenergy physics. Eur. Phys. J. C 76, 235 (2016). https://doi.org/10.1140/epjc/s1005201640994arXiv:1601.07913
J. Dolen, P. Harris, S. Marzani, S. Rappoccio, N. Tran, Thinking outside the ROCs: designing decorrelated taggers (DDT) for jet substructure. JHEP 05, 156 (2016). https://doi.org/10.1007/JHEP05(2016)156arXiv:1603.00027
I. Moult, B. Nachman, D. Neill, Convolved substructure: analytically decorrelating jet substructure observables. JHEP 05, 002 (2018). https://doi.org/10.1007/JHEP05(2018)002arXiv:1710.06859
J. Stevens, M. Williams, uBoost: a boosting method for producing uniform selection efficiencies from multivariate classifiers. JINST 8, P12013 (2013). https://doi.org/10.1088/17480221/8/12/P12013arXiv:1305.7248
C. Shimmin, P. Sadowski, P. Baldi, E. Weik, D. Whiteson, E. Goul et al., Decorrelated jet substructure tagging using adversarial neural networks. Phys. Rev. D 96, 074034 (2017). https://doi.org/10.1103/PhysRevD.96.074034arXiv:1703.03507
L. Bradshaw, R.K. Mishra, A. Mitridate, B. Ostdiek, Mass agnostic jet taggers. SciPost Phys. 8, 011 (2020). https://doi.org/10.21468/SciPostPhys.8.1.011arXiv:1908.08959
ATLAS Collaboration, Performance of massdecorrelated jet substructure observables for hadronic twobody decay tagging in ATLAS. Technical Report. ATLPHYSPUB2018014, CERN, Geneva (2018)
G. Kasieczka, D. Shih, Robust jet classifiers through distance correlation. Phys. Rev. Lett. 125, 122001 (2020). https://doi.org/10.1103/PhysRevLett.125.122001arXiv:2001.05310
L.G. Xia, QBDT, a new boosting decision tree method with systematical uncertainties into training for High Energy Physics. Nucl. Instrum. Methods A 930, 15–26 (2019). https://doi.org/10.1016/j.nima.2019.03.088arXiv:1810.08387
A. Rogozhnikov, A. Bukva, V. Gligorov, A. Ustyuzhanin, M. Williams, New approaches for boosting to uniformity. JINST 10, T03002 (2015). https://doi.org/10.1088/17480221/10/03/T03002arXiv:1410.4140
CMS Collaboration, A deep neural network to search for new longlived particles decaying to jets. Mach. Learn. Sci. Technol. (2020). https://doi.org/10.1088/26322153/ab9023. arXiv:1912.12238
G. Kasieczka, B. Nachman, M.D. Schwartz, D. Shih, Automating the ABCD method with machine learning. Phys. Rev. D 103, 035021 (2021). https://doi.org/10.1103/PhysRevD.103.035021arXiv:2007.14400
O. Kitouni, B. Nachman, C. Weisser, M. Williams, Enhancing searches for resonances with machine learning and moment decomposition. JHEP 21, 070 (2020). https://doi.org/10.1007/JHEP04(2021)070arXiv:2010.09745
G. Louppe, M. Kagan, K. Cranmer, Learning to pivot with adversarial networks. Adv. Neural Inf. Process. Syst. 30, 981 (2017). arXiv:1611.01046
C. Englert, P. Galler, P. Harris, M. Spannowsky, Machine learning uncertainties with adversarial neural networks. Eur. Phys. J. C 79, 4 (2019). https://doi.org/10.1140/epjc/s1005201865118arXiv:1807.08763
S. Wunsch, S. Jörger, R. Wolf, G. Quast, Reducing the dependence of the neural network function to systematic uncertainties in the input space. Comput. Softw. Big Sci. 4, 5 (2020). https://doi.org/10.1007/s41781020000379arXiv:1907.11674
J.M. Clavijo, P. Glaysher, J.M. Katzy, Adversarial domain adaptation to reduce sample bias of a high energy physics classifier. Mach. Learn. Sci. Tech. 3(1), 015014 (2022)
B. Andersson, G. Gustafson, G. Ingelman, T. Sjostrand, Parton fragmentation and string dynamics. Phys. Rep. 97, 31–145 (1983). https://doi.org/10.1016/03701573(83)900807
T. Sjostrand, Jet fragmentation of nearby partons. Nucl. Phys. B 248, 469–502 (1984). https://doi.org/10.1016/05503213(84)906072
T. Sjostrand, S. Mrenna, P.Z. Skands, PYTHIA 6.4 physics and manual. JHEP 05, 026 (2006). https://doi.org/10.1088/11266708/2006/05/026arXiv:hepph/0603175
T. Sjostrand, S. Mrenna, P.Z. Skands, A brief introduction to PYTHIA 8.1. Comput. Phys. Commun. 178, 852–867 (2008). https://doi.org/10.1016/j.cpc.2008.01.036arXiv:0710.3820
B.R. Webber, A QCD model for jet fragmentation including soft gluon interference. Nucl. Phys. B 238, 492–528 (1984). https://doi.org/10.1016/05503213(84)90333X
J.C. Winter, F. Krauss, G. Soff, A modified cluster hadronization model. Eur. Phys. J. C 36, 381–395 (2004). https://doi.org/10.1140/epjc/s2004019608arXiv:hepph/0311085
J. Bellm et al., Herwig 7.0/Herwig++ 3.0 release note. Eur. Phys. J. C 76, 196 (2016). https://doi.org/10.1140/epjc/s1005201640188arXiv:1512.01178
M. Bahr et al., Herwig++ physics and manual. Eur. Phys. J. C 58, 639–707 (2008). https://doi.org/10.1140/epjc/s1005200807989arXiv:0803.0883
S. Chang, T. Cohen, B. Ostdiek, What is the machine learning? Phys. Rev. D 97, 056009 (2018). https://doi.org/10.1103/PhysRevD.97.056009arXiv:1709.10106
L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, A. Schwartzman, Jetimages—deep learning edition. JHEP 07, 069 (2016). https://doi.org/10.1007/JHEP07(2016)069. arXiv:1511.05190
G.J. Székely, M.L. Rizzo, N.K. Bakirov, Measuring and testing dependence by correlation of distances. Ann. Stat. 35, 2769–2794 (2007). https://doi.org/10.1214/009053607000000505
G.J. Székely, M.L. Rizzo, Brownian distance covariance. Ann. Appl. Stat. 3, 1236–1265 (2009). https://doi.org/10.1214/09AOAS312
G.J. Székely, M.L. Rizzo, The distance correlation ttest of independence in high dimension. J. Multivar. Anal. 117, 193–213 (2013). https://doi.org/10.1016/j.jmva.2013.02.012
G.J. Székely, M.L. Rizzo, Partial distance correlation with methods for dissimilarities. Ann. Stat. 42, 2382–2412 (2014). https://doi.org/10.1214/14AOS1255
F. Chollet, Keras (2017). https://github.com/fchollet/keras
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean et al., Tensorflow: a system for largescale machine learning, in OSDI, vol. 16 (2016), pp. 265–283
D. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv:1412.6980
ATLAS Collaboration, M. Aaboud et al., Performance of topquark and \(W\)boson tagging with ATLAS in Run 2 of the LHC. Eur. Phys. J. C 79, 375 (2019). https://doi.org/10.1140/epjc/s1005201968478. arXiv:1808.07858
CMS Collaboration, A.M. Sirunyan et al., Identification of heavy, energetic, hadronically decaying particles using machinelearning techniques. JINST 15, P06005 (2020). https://doi.org/10.1088/17480221/15/06/P06005. arXiv:2004.08262
A. Butter et al., The machine learning landscape of top taggers. SciPost Phys. 7, 014 (2019). https://doi.org/10.21468/SciPostPhys.7.1.014arXiv:1902.09914
J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer et al., The automated computation of treelevel and nexttoleading order differential cross sections, and their matching to parton shower simulations. JHEP 07, 079 (2014). https://doi.org/10.1007/JHEP07(2014)079arXiv:1405.0301
R.D. Ball et al., Parton distributions with LHC data. Nucl. Phys. B 867, 244–289 (2013). https://doi.org/10.1016/j.nuclphysb.2012.10.003arXiv:1207.1303
T. Gleisberg, S. Hoeche, F. Krauss, M. Schonherr, S. Schumann, F. Siegert et al., Event generation with SHERPA 1.1. JHEP 02, 007 (2009). https://doi.org/10.1088/11266708/2009/02/007arXiv:0811.4622
Sherpa Collaboration, E. Bothmann et al., Event generation with Sherpa 2.2. SciPost Phys. 7, 034 (2019). https://doi.org/10.21468/SciPostPhys.7.3.034. arXiv:1905.09127
N. Dawe, E. Rodrigues, H. Schreiner, B. Ostdiek, D. Kalinkin, M.R. et al., scikithep/pyjet: version 1.8.2. Jan. (2021). https://doi.org/10.5281/zenodo.4446849
M. Cacciari, G.P. Salam, G. Soyez, FastJet user manual. Eur. Phys. J. C 72, 1896 (2012). https://doi.org/10.1140/epjc/s1005201218962arXiv:1111.6097
M. Cacciari, G.P. Salam, G. Soyez, The anti\(k_t\) jet clustering algorithm. JHEP 04, 063 (2008). https://doi.org/10.1088/11266708/2008/04/063arXiv:0802.1189
J. Thaler, K. Van Tilburg, Identifying boosted objects with Nsubjettiness. JHEP 03, 015 (2011). https://doi.org/10.1007/JHEP03(2011)015arXiv:1011.2268
J. Thaler, K. Van Tilburg, Maximizing boosted top identification by minimizing Nsubjettiness. JHEP 02, 093 (2012). https://doi.org/10.1007/JHEP02(2012)093arXiv:1108.2701
A.J. Larkoski, I. Moult, D. Neill, Power counting to better jet observables. JHEP 12, 009 (2014). https://doi.org/10.1007/JHEP12(2014)009arXiv:1409.6298
A.J. Larkoski, G.P. Salam, J. Thaler, Energy correlation functions for jet substructure. JHEP 06, 108 (2013). https://doi.org/10.1007/JHEP06(2013)108arXiv:1305.0007
Particle Data Group, Review of particle physics. Prog. Theor. Exp. Phys. 2020, 08 (2020). https://doi.org/10.1093/ptep/ptaa104
J.M. Butterworth, A.R. Davison, M. Rubin, G.P. Salam, Jet substructure as a new Higgs search channel at the LHC. Phys. Rev. Lett. 100, 242001 (2008). https://doi.org/10.1103/PhysRevLett.100.242001arXiv:0802.2470
S.D. Ellis, C.K. Vermilion, J.R. Walsh, Recombination algorithms and jet substructure: pruning as a tool for heavy particle searches. Phys. Rev. D 81, 094023 (2010). https://doi.org/10.1103/PhysRevD.81.094023arXiv:0912.0033
D. Krohn, J. Thaler, L.T. Wang, Jet trimming. JHEP 02, 084 (2010). https://doi.org/10.1007/JHEP02(2010)084arXiv:0912.1342
M. Dasgupta, A. Fregoso, S. Marzani, G.P. Salam, Towards an understanding of jet substructure. JHEP 09, 029 (2013). https://doi.org/10.1007/JHEP09(2013)029arXiv:1307.0007
A.J. Larkoski, S. Marzani, G. Soyez, J. Thaler, Soft drop. JHEP 05, 146 (2014). https://doi.org/10.1007/JHEP05(2014)146arXiv:1402.2657
Y. Ganin, V. Lempitsky, Unsupervised domain adaptation by backpropagation. Proc. Mach. Learn. Res. 37, 1180–1189 (2015)
ATLAS Collaboration, M. Aaboud et al., Measurement of the inclusive crosssections of single topquark and topantiquark \(t\)channel production in \(pp\) collisions at \(\sqrt{s}\) = 13 TeV with the ATLAS detector. JHEP 04, 086 (2017). https://doi.org/10.1007/JHEP04(2017)086. arXiv:1609.03920
CMS Collaboration, A.M. Sirunyan et al., Measurement of differential cross sections and charge ratios for tchannel single top quark production in proton–proton collisions at \(\sqrt{s}=13\,\text{Te}\text{ V }\). Eur. Phys. J. C 80, 370 (2020). https://doi.org/10.1140/epjc/s1005202078581. arXiv:1907.08330
DELPHES 3 Collaboration, J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaitre, A. Mertens et al., DELPHES 3, a modular framework for fast simulation of a generic collider experiment. JHEP 02, 057 (2014). https://doi.org/10.1007/JHEP02(2014)057. arXiv:1307.6346
A. Mertens, New features in Delphes 3. J. Phys. Conf. Ser. 608, 012045 (2015). https://doi.org/10.1088/17426596/608/1/012045
M. Selvaggi, DELPHES 3: a modular framework for fastsimulation of generic collider experiments. J. Phys. Conf. Ser. 523, 012033 (2014). https://doi.org/10.1088/17426596/523/1/012033
M. Cacciari, G.P. Salam, Dispelling the \(N^{3}\) myth for the \(k_t\) jetfinder. Phys. Lett. B 641, 57 (2006). https://doi.org/10.1016/j.physletb.2006.08.037arXiv:hepph/0512210
S. Wunsch, S. Jörger, R. Wolf, G. Quast, Optimal statistical inference in the presence of systematic uncertainties using neural network optimization based on binned Poisson likelihoods with nuisance parameters. Comput. Softw. Big Sci. 5, 4 (2021). https://doi.org/10.1007/s41781020000495arXiv:2003.07186
A. Elwood, D. Krücker, M. Shchedrolosiev, Direct optimization of the discovery significance in machine learning for new physics searches in particle colliders. J. Phys. Conf. Ser. 1525, 012110 (2020). https://doi.org/10.1088/17426596/1525/1/012110
P. De Castro, T. Dorigo, INFERNO: inferenceaware neural optimisation. Comput. Phys. Commun. 244, 170–179 (2019). https://doi.org/10.1016/j.cpc.2019.06.007arXiv:1806.04743
T. Charnock, G. Lavaux, B.D. Wandelt, Automatic physical inference with information maximizing neural networks. Phys. Rev. D 97 (2018). https://doi.org/10.1103/physrevd.97.083004
J. Alsing, B. Wandelt, Nuisance hardened data compression for fast likelihoodfree inference. Mon. Not. R. Astron. Soc. 488, 5093–5103 (2019). https://doi.org/10.1093/mnras/stz1900arXiv:1903.01473
L. Heinrich, N. Simpson, pyhf/neos: initial zenodo release (2020). https://doi.org/10.5281/zenodo.3697981
J. Brehmer, F. Kling, I. Espejo, K. Cranmer, MadMiner: machine learningbased inference for particle physics. Comput. Softw. Big Sci. 4, 3 (2020). https://doi.org/10.1007/s4178102000352arXiv:1907.10621
J. Brehmer, G. Louppe, J. Pavez, K. Cranmer, Mining gold from implicit models to improve likelihoodfree inference. Proc. Natl. Acad. Sci. 201915980 (2020). https://doi.org/10.1073/pnas.1915980117. arXiv:1805.12244
J. Brehmer, K. Cranmer, G. Louppe, J. Pavez, Constraining effective field theories with machine learning. Phys. Rev. Lett. 121, 111801 (2018). arXiv:1805.00013
J. Brehmer, K. Cranmer, G. Louppe, J. Pavez, A guide to constraining effective field theories with machine learning. Phys. Rev. D 98, 052004 (2018). https://doi.org/10.1103/PhysRevD.98.052004arXiv:1805.00020
Acknowledgements
We are grateful to YiLun Chung for producing the fragmentation variation samples used in Sect. 3.1. We thank Kingman Cheung, ShihChieh Hsu, Tilman Plehn, David Rousseau, David Shih, Michael Spannowsky, and Daniel Whiteson for useful discussions and comments on the manuscript. BN was supported by the Department of Energy, Office of Science under contract number DEAC0205CH11231. AG was supported by the U.S. Department of Energy (DOE), Office of Science under Grant no. DESC0009920.
Author information
Authors and Affiliations
Corresponding author
Appendix A: Training with \(\lambda =0\)
Appendix A: Training with \(\lambda =0\)
Figures 7 and 8 show the impact of using the adversarial setup, but with \(\lambda =0\), i.e. the adversary is turned off. The only difference with respect to the nominal configuration is that Pythia and Herwig (factorization scale variations) are used instead of just Pythia (\(\mu =1\)) for the nominal for the twopoint (continuous) uncertainty example.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Funded by SCOAP^{3}
About this article
Cite this article
Ghosh, A., Nachman, B. A cautionary tale of decorrelating theory uncertainties. Eur. Phys. J. C 82, 46 (2022). https://doi.org/10.1140/epjc/s1005202210012w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epjc/s1005202210012w