1 Introduction

Deep learning is becoming widely used for various classification tasks in collider physics (see e.g., Refs. [1,2,3,4,5,6]). One of the core benefits of deep learning over traditional analysis techniques is that it is able to identify patterns in very high-dimensional feature spaces. At the Large Hadron Collider (LHC), such low-level inputs are dominated by hadronic activity. Most machine learning approaches are trained using Parton Shower Monte Carlo (PSMC) simulations that produce exclusive final states with the same complexity as real data [7]. However, there are significant variations between PSMCs due to the large number of perturbative and non-perturbative modeling assumptions.

These variations lead to potential biases and suboptimal sensitivity in data analyses [8]. A bias occurs when the simulation model used for inference (given an analysis strategy) is not the same as nature. There is a large and growing literature on methods to reduce biases from PSMC model variations through decorrelation [9,10,11,12] and other approaches [13,14,15]. A key challenge with modeling uncertainties in contrast to experimental uncertainties is that they are often estimated by comparing two simulations. This difference does not have a statistical origin and may not be the full uncertainty, so caution is required to reduce the uncertainty through automated approaches [16]. A general solution to estimating (and then reducing) systematic uncertainties from PSMC variations is still an active area of research and development.Footnote 1

In principle, the same challenge exists when quantifying suboptimal performance due to PSMC variations. Suboptimal performance occurs when the simulation model used for training a machine learning model is different than nature. While not directly a source of systematic uncertainty, this suboptimality has important consequences for the physics program of the LHC. To quantify the suboptimality, one could compare different PSMC models, as is done for determine the systematic uncertainty. This has the same unsatisfying properties as described above.

However, there have been a number of hints in the literature that the suboptimality due to PSMC variations may actually be small. For example, Ref. [19] observed that training a quark versus gluon jet classifier with the Herwig [20] PSMC and then applying it to jets simulated with the Pythia [21] PSMC has nearly the same performance as training with Pythia and also testing on Pythia (with a statistically identical, but independent dataset). This small difference in performance is contrasted to the large difference in performance when testing on jets from Herwig. A similar result was observed in the context of signal jets in Ref. [22]. From these observations, we conjecture that the deep learning models are learning universal properties of quantum chromodynamics (QCD). We hypothesis that the performance gaps present when the test sets differ simply reflects variations in the amount of QCD radiation, but not the type of information that is useful for discrimination.

To build intuition for this conjecture, consider the case of quark versus gluon jet tagging. At leading logarithmic (LL) order and considering only infrared and collinear safe observables, the optimal classifier is simply iterated-soft-drop multiplicity inside the jet [23]. This statement is true independent of the strong coupling constant, \(\alpha _s\). However, common metrics of performance such as the Area Under the Curve (AUC) depend on \(\alpha _s\);Footnote 2 when there are more emissions (higher \(\alpha _s\)), the quark and gluon perturbative multiplicity distributions are more separable. In particular, at LL, perturbative multiplicity is a Poisson random variable with a mean that is proportional to a color factor multiplied by \(\alpha _s\). As \(\alpha _s\) grows, the gluon distribution grows significantly faster than the quark one:

$$\begin{aligned} \frac{\mu _g-\mu _q}{\sqrt{\sigma _g^2+\sigma _q^2}}&\sim \frac{\alpha _s(C_F-C_A)}{\sqrt{\alpha _s C_F+\alpha _s C_A}}\propto \sqrt{\alpha _s}\,, \end{aligned}$$
(1.1)

where \(C_F=4/3\) (\(C_A=3\)) is the quark (gluon) color factor. Imagine that two PSMCs had the same physics approximations, but different values of \(\alpha _s\). They would find the same classifier and thus if the test set is the same, the performance would be the same.

Our goal is to test the universality hypothesis in detail using the important benchmark problem of Lorentz boosted Higgs boson jet versus QCD jet tagging. In this context, universality means that the learned classifiers are the same up to a monotonic re-scaling, which means that they result in the same decision boundaries. We consider both shallow and deep learning models as well as a variety of PSMC models.

This paper is organized as follows. A concrete example are introduced in Sect. 2. Architectures of deep-learning classifiers are in Sect. 3. The results are provided in Sect. 4. The paper ends with conclusions and outlook in Sect. 5.

2 Numerical examples

Lorenz-boosted Higgs tagging, focusing on the \(b\bar{b}\) final state, is the example in this study. High-level features and low-level inputs are used to train shallow and deep-learning classifiers.

2.1 Monte Carlo samples

This study considers Lorenz-boosted Higgs tagging, focusing on the \(b\bar{b}\) final state. The signal is high \(p_T\) Higgs bosons and the background is generic quark and gluon jets. The hard-scatter reactions are common to all parton shower models and are generated with MadGraph5_aMC@NLO 2.7.3 [24] for modeling pp collisions at \(\sqrt{s}\) = 14 TeV. The PDF4LHC15_nnlo_mc [25] parton distribution function and the NNPDF30_nlo_as_0118 [26] parton distribution function are used for signal and background, respectively.

The hard-scattering events are passed to Pythia 8.303 [21] to simulate the parton shower, using three different complete parton-shower frameworks. The first one is default setting, where evolution variable is virtuality of the off-shell propagator. The second framework is Virtual Numerical Collider with Interleaved Antennae (Vincia) shower [27,28,29], where the evolution variable is transverse momentum for QCD + EW/QED showers based on the antenna formalism. The last framework is Dipole resummation (Dire), which is a transverse-momentum ordered dipole shower. The PYTHIA family uses the string model [30, 31] for hadronization. The string model is based on string fragmentation function to break string to form hardons. Herwig 7.2.2 [20] with angularly-ordered showers is also used to model the parton shower. The cluster model [32, 33] is implemented in Herwig 7.2.2. The cluster hadronization model is based on preconfinement. This model forcibly decays gluons into quark-antiquark pairs and form neutral clusters. Pyjet [34, 35] and the anti-\(k_t\) [36] algorithm with radius parameter R = 1.0 are used to define the jets.

An event preselection similar to Ref. [37] is used to reject most background events. The Higgs-like jet is required to satisfy 300 GeV < \(p^J_T\) < 500 GeV, 110 GeV < invariant mass of the jet (\(M_J\)) < 160 GeV and to be double b-tagged. Jets are declared double b-tagged if they have two or more ghosted-associated [38, 39] B hadrons. After the preselection, the high-level jet features and low-level features are used to probe the universality of discriminating boosted Higgs jets from QCD jets.

Since the goal of this paper is to investigate the universality of hadronic jet classification, there are a number of simplifying assumptions. The background in the study is only generic quark and gluon jets. The relatively smaller \(t\bar{t}\) background is ignored. For each PSMC setup, the default parameters are used.

2.2 High-level features

In order to distinguish Higgs jets via Gradient Tree Boosting (BDT) and a fully connected / dense neural network, the following six commonly-used high-level features are considered:

  1. 1.

    \(M_J\): invariant mass of the leading jet;

  2. 2.

    \(\tau _{21}=\tau _2/\tau _1\): n-subjettiness ratio [40, 41];

  3. 3.

    \(D_2^{(\beta )}=e_3^{(\beta )}/(e_2^{(\beta )})^3\) with \(\beta =1,2\): energy correlation function ratios [42];

  4. 4.

    \(C_2^{(\beta )}=e_3^{(\beta )}/(e_2^{(\beta )})^2\) with \(\beta =1,2\): energy correlation function ratios [43];

where \(e_i\) is the normalized sum over doublets (\(i=2\)) or triplets (\(i=3\)) of constituents inside jets, weighted by the product of the constituent transverse momenta and pairwise angular distances. For this analysis, \(\beta \) is considered to be 1 and 2.

Fig. 1
figure 1

The six high-level features used to distinguish boosted Higgs boson jets from QCD jets events

The distributions of these six variables are shown in Fig. 1, in which the capability of each observable to discriminate between signal and background is demonstrated. The salient features of these histograms are described below.

The jet invariant mass distribution peaks near the Higgs boson mass of 125 GeV [44] for the signal and has a broad distribution for the background. In the setup of this study, Herwig 7.2.2 with angularly-ordered showers leads to slightly higher and broader signal peak due to different underlying event structure compared to Pythia 8.303. Similarly, the distributions of \(\tau _{21}\), \(D_2^\beta \), and \(C_2^\beta \) show similar position and shape of the peak among the Pythia PSMC’s, but somewhat different for the Herwig Angular. The two-prong structure due to the decay of massive objects into two hard QCD partons in the case of the signal jets results in low \(\tau _{21}\), \(D_2\) and \(C_2\).

2.3 Low-level features

The low-level inputs to the CNN are images of Higgs-like jet [45, 46]. The resolution is 40\(\times \)40 pixels and in 1R\(\times \)1R range, where R is the jet radius. The images consist of three channels, analogous to the Red-Green-Blue (RGB) channels of a color image [19]. The pixel intensity for the three channels correspond to the sum of the charged particle \(p_T\), the sum of the neutral particle \(p_T\), and the number of charged particles in a given region of the image. The Higgs-like jet images are rotated to align along two-subject’s axis. The leading subjet is at the origin and the subleading subjet is directly below the leading subjet. If there is a third-leading subjet, the image will be reflected. All images are normalized so that the intensities all sum to unity.Footnote 3 After normalization, the pixel intensities are standardized so that their distribution has mean zero and unit variance. Figure 2 shows the average Higgs-like jet images in the charged \(p_T\) channel. The patterns in the charged \(p_T\) channel are similar to the other two channels.

Figure 3 shows the difference between the four PSMC algorithms with respect to Pythia 8.303 default showering, referred to as the nominal simulation. The substructure in jets are different among the other three PSMC simulations with respect to the nominal sample due to different approximations made in the final state radiation and other QCD effects. This diversity of the PSMC approaches may effect the performance of jet classifiers trained on low-level features. Therefore, we train a convolutional neural network-based jet classifier to explore this generator-dependence of classification performance.

Fig. 2
figure 2

Low-level features. The average of 40000 Higgs-like jet images in the charged \(p_T\) channel (left column and middle column). \(Q_1\) and \(Q_2\) denote the new axes after the jet’s axis is centralized and rotated. The intensity in each pixel is the sum of the charged particle \(p_T\). The total intensity in each image is normalized to unity. Images in right column are the average difference between Higgs jet and QCD jet images

Fig. 3
figure 3

The average difference between the other generators and the Pythia Default showering a Higgs-like jet images, and b QCD jet images. \(Q_1\) and \(Q_2\) denote the new axes after the jet’s axis is centralized and rotated

Fig. 4
figure 4

The QCD rejection (inverse QCD efficiency) as a function of the Higgs jet efficiency for classifiers applied to Herwig angular jet from four PSMC algorithms. The bottom panel shows the relative uncertainties

3 Classifier architectures

The BDT has a fixed number of estimators (1400) with maximum depth 5. The minimum number of samples is fixed at 5% as required to split an internal node and 1% as required to be at a leaf node. This BDT model is trained on the high-level features of the jet using the scikit-learn library [48]. KerasTuner [49] is used to get the best configuration of hyperparameters.

The dense neural network has four full connected layers. There are 224, 928, 288 and 1024 neurons, respectively. Rectified linear unit (ReLU) activation functions are used for all layers of this neural network. Before the output layer, Dropout [50] regularization is added to reduce overfitting with a dropout rate = 0.01. For this two-class problem, the activation function of the output layer is a sigmoid function. The binary cross entropy loss function is optimized during the training. The Adam optimizer [51] with a learning rate of 6.5428\(\times 10^{-5}\) is used to select the network weights. The KerasTuner [49] is used to get the best configuration of hyperparameters. The Keras-2.4.0 library is used to train the dense neural network models with the TENSORFLOW-2.4.1 [52] backend, on a NVIDIA A100 SXM 80GB Graphical Processing Unit (GPU).

Details of the CNN are as follows. The convolution filter is 5\(\times \)5, the maximum pooling layers are 2\(\times \)2, and the stride length is 1. ReLU activation functions are used for all intermediate layers of the neural network. The first convolution layer has 96 filters and the second convolution layer in each stream has 32 filters. A flatten layer is used after the second maximum pooling layer. Two dense layers are connected to the flatten layer with 350 and 400 neurons, respectively. Before the output layer, Dropout regularization is added with a dropout rate = 0.01. As for the dense network, the last activation is a sigmoid function and binary cross entropy is optimized during training. The AdaDelta optimizer [53] with learning rate 6.0216\(\times 10^{-3}\) is used to select the network weights. The KerasTuner [49] is used to get the best configuration of hyperparameters. The same setup as for the dense network is used to run the CNN.

4 Results

In this study, the receiver operating characteristic curve (ROC), the area under the ROC curve (AUC), the maximum significance improvement characteristic (SIC) and rejection (inverse background efficiency) at 50% signal efficiency are used to be metrics to quantify the universality. The AUC is between 0.5 (poor classification performance) and 1 (maximum classification performance). The SIC is the signal efficiency divided by the square root of the background efficiency and represents by how much (as a multiplicative factor) the significance would improve with a given threshold on the classifier score. The maximum SIC is simply the maximum SIC attained across all thresholds. In order to quantify the variation from classifier training itself, the performance is evaluated by k-fold cross-validation technique with \(k=50\). In this procedure, the datasets are randomly partitioned into 50 parts and for each one, the other 49 sets are used for constructing the classifier. The mean and spread over the folds is used to quantify the model performance.

Figure 4 shows four classifiers trained on various simulations and then tested on the same Herwig dataset. Overall, the CNN has the best performance and the DNN is marginally better than the BDT. The DNN and BDT are trained on the same features and given the relatively low-dimensionality of the problem, it is unsurprising that the two models have a similar performance. Overall, the performance is nearly identical for all training sets. This is even true for the CNN, which has access to low-level substructure information inside the jets. The insensitivity to the training set is in stark contrast to the sensitivity of the test set, as summarized in detail below. Additional results can be found in Appendix A.

The performance of Fig. 4 for all combinations of train and test sets for the three machine learning models are summarized in Figs. 56 and 7. Starting with Fig. 5, we observe that there is a significant spread in performance across test sets (rows). The difference between Higgs jets and QCD jets is smaller for Herwig compared with Pythia by almost 10%. However, the spread in performance for a given test set is about 1%. Similar trends are present for the rejection at a fixed efficiency (Fig. 6) and maximum SIC (Fig. 7) plots, albeit with larger sensitivities to the machine learning training.

Fig. 5
figure 5

The performance of classifiers as quantified by the AUC when training on a given PSMC (color) and testing on the PSMC specified on the vertical axis. The symbols represent the type of model (BDT, DNN, CNN). The error bars represent the standard deviation over the k folds

Fig. 6
figure 6

The performance of classifiers as quantified by the rejection at a fixed signal efficiency of 50% when training on a given PSMC (color) and testing on the PSMC specified on the vertical axis. The symbols represent the type of model (BDT, DNN, CNN). The error bars represent the standard deviation over the k folds

Fig. 7
figure 7

The performance of classifiers as quantified by the maximum significance improvement when training on a given PSMC (color) and testing on the PSMC specified on the vertical axis. The symbols represent the type of model (BDT, DNN, CNN). The error bars represent the standard deviation over the k folds

5 Conclusions and outlook

We have explored the universality of classifiers trained on hadronic jet tagging. In particular, we have studied the sensitivity of the learned classifier to the Parton Shower Monte Carlo program used during training. While the modeling of the hadronic structure differs significantly among PSMCs, we find that the actual function learned is nearly independent of the training set. This gives us confidence that a classifier trained on one PSMC and tested on another (or data) will likely still be optimal. Although it is not directly a source of uncertainty for physics analysis, this observation has important implications for making the best use of our data. The classifier universality does not mean that the systematic uncertainty from hadronic modeling is small as bias and optimality are separate concepts (see e.g., Ref. [8]).

The universality not only has important experimental implications, but also motivates further theoretical studies. As in the quark versus gluon jet example referenced in Sect. 1, the universality of the classifiers suggests that a theoretical explanation of the classification performance may be attainable as it should be insensitive to the detailed modeling assumptions of a particular PSMC program. We look forward to studies in this direction.

Uncertainty quantification is a critical component of any analysis at the LHC and this task is particularly challenging for analysis strategies like machine learning that are sensitive to low-level hadronic modeling. While determining systematic uncertainties on the potential bias of a result from hadronic modeling is still an active area of research and development, we have shown that at least the optimality of machine learning classifiers is relatively insensitive to hadronic modeling. While we have observed this disconnect between bias and optimality for Higgs jet tagging, we conjecture that this is a generic feature of QCD and it may also be present in other systems at the LHC and beyond.