Interpretable deep learning for two-prong jet classification with jet spectra

Classification of jets with deep learning has gained significant attention in recent times. However, the performance of deep neural networks is often achieved at the cost of interpretability. Here we propose an interpretable network trained on the jet spectrum S2(R) which is a two-point correlation function of the jet constituents. The spectrum can be derived from a functional Taylor series of an arbitrary jet classifier function of energy flows. An interpretable network can be obtained by truncating the series. The intermediate feature of the network is an infrared and collinear safe C-correlator which allows us to estimate the importance of an S2(R) deposit at an angular scale R in the classification. The performance of the architecture is comparable to that of a convolutional neural network (CNN) trained on jet images, although the number of inputs and complexity of the architecture is significantly simpler than the CNN classifier. We consider two examples: one is the classification of two-prong jets which differ in color charge of the mother particle, and the other is a comparison between Pythia 8 and Herwig 7 generated jets.


Introduction
Deep learning is gaining significant interest recently in the field of collider data analysis.One of the primary motivations is to extract the maximum information from the complex collision events.The deep learning in collider physics takes advantage of a large influx of data from experiments, more precise theoretical predictions, significant improvement in computing power, and ongoing progress in the field of machine learning itself.Such techniques offer advances in areas ranging from event selection to particle identification.
The output of a neural network is, in general, a highly non-linear function of the inputs.A neural network classifier often acts like a "black box".One may consider architectures with posthoc interpretability [57] which allows us to extract information, other than its prediction, from the learned model after training.A simple strategy is using a predefined functional form to restrict the representation power of the neural network [31,58].Then the network is interpreted in terms of the functional form.The aim of this paper is also to construct an interpretable neural network architecture that allows us not only to interpret the predictions of the network but also to visualize it in terms of trained weights connected to physical variables.
In [28], a multilayer perceptron (MLP) trained on two-point correlation functions S 2 and S 2,trim of angular scale R was introduced.The S 2 (R) and S 2,trim (R) spectra are constructed from the constituents of a jet before and after the trimming [59] respectively.The angular scale R is an important parameter for describing kinematics of a decaying particle and parton shower (PS); hence, these spectra efficiently encode the radiation pattern inside a jet.The MLP trained on these inputs learns relevant features for the classification among the Higgs boson jet (Higgs jet) and QCD jet.
In this paper, we connect the spectra to energy flow functionals P T ( R) [60], i.e., we consider transverse energy of a jet constituent as particle-specific information at R in the η − φ plane [61].The spectra are basis vectors of infrared and collinear (IRC) safe variables called bilinear C-correlators [60] whose angular weighting function depends only on the relative distance between two constituents.Those correlators naturally appear in the functional Taylor series of a classifier of P T ( R), and the MLP can be considered as a subseries of the Taylor series.We show that the performance of the MLP and neural networks trained on jet images [18,19,21,62] are comparable.This strongly suggests that S 2 and S 2,trim contain sufficient information for jet classification.Encouraged by this feature, we construct an interpretable architecture by truncating the series.Namely, dR S 2 (R)w(R; x kin ) can be implemented in a classifier after proper discretization in R, where x kin is a set of kinematic variables of the jet and w is a smooth function.By reading the weights w(R; x kin ), we could quantify important features for the given classification problem.
Jet substructure studies often suffer from systematic uncertainties of soft activities.The soft radiations generated by a Monte Carlo program are strongly model dependent.While this mismodeling could be corrected by using real data, it is certainly useful to use input variables with less systematic uncertainties.When hard substructures are important for solving the problem, we may use jet grooming techniques [1,10,59,63,64] to remove the soft activity.Instead of throwing this soft activity away, we encode it in S 2,soft (R) which is S 2 (R) − S 2,trim (R).Then, the inputs S 2,trim and S 2,soft include hard and soft substructure information respectively.The interpretable architecture trained on S 2,trim and S 2,soft is able to quantify these features.We study two classification problems: one is classificaiton of two-prong jets to understand their hard substructures and color coherence, and the other is comparison of Pythia 8 [65] and Herwig 7 [66,67] events to quantify the differences.
The paper is organized as follows.In section 2, we review S 2 and S 2,trim and show its relation to energy flow and C-correlators.We also show S 2 and S 2,trim distributions of typical Higgs jet and QCD jet.A hypothetical color octet scalar particle, sgluon, decaying to b b is considered to study the color connection in two-prong jets.In section 3, we first discuss the capability of S 2 and S 2,trim for the classification of two-prong jets and show the result of an MLP trained on those inputs.The results are then compared with that of a CNN trained on jet images.In section 4, we introduce a two-level architecture consists of a softmax classifier and an MLP trained on S 2 and S 2,trim .The intermediate feature of this architecture is the bilinear C-correlator whose basis vectors are S 2,trim and S 2,soft and components are generated by the MLP.We visualize and interpret the weights of the given classification problem.Finally, the summary and outlook are given in section 5.
2 Two-Point Correlation Spectrum and Two-Prong Jets

Jet Spectra
In [28], we introduced a two-point correlation spectral function S 2 (R) which maps a jet to a function of angular scale R, where J is a set of jet constituents, R i = (η i , φ i ) is the position of the i-th jet constituent in the pseudorapidity-azimuth plane, is the angular distance between the two jet constituents i and j, and P T ( R) is an energy flow functional [60] of J.For practical purpose, S 2 (R) is discretized as below, where The spectral function S 2 (R; ∆R) is, therefore, the sum of the product of p T 's of the two jet constituents with an angular distance R ij lying between R and R + ∆R.We obatin IRC safe quantities by multiplying smooth functions2 w( R) and P T ( R) (or S 2 (R)), and integrating over R. To understand the IRC safety of P T ( R), let us consider a splitting of a given constituent i 0 in J into two constituents, i 0 → i 1 i 2 .The inner product of w( R) and the difference of the energy flows before and after the splitting, δP T ( R), is given as follows, where δp T,i0 = p T,i1 + p T,i2 − p T,i0 , and δ R i1(i2) = R i1(i2) − R i0 .The soft limit, where i 2 carries a small momentum, corresponds to δp T,i0 , δ R i1 , p T,i2 → 0, while δp T,i0 , δ R i1 , δ R i2 → 0 in the collinear limit.The integral vanishes in these limits, namely the energy flow after parton splitting converges weakly [60] to the one before splitting.The spectrum S 2 (R) inherits the same property.The inner product of the smooth function w(R) and the difference of the spectrum, δS 2 (R), before and after the splitting i 0 → i 1 i 2 is given as follows, Again, this integral vanishes in the IRC limits.Note that the binned spectrum S 2 (R; ∆R) is not completely IRC safe because of the discontinuity of the indicator function at the bin boundaries.
Nevertheless, when the domain is discretized into small sections [R i , R i + ∆R i ), the IRC unsafe terms cancel in the sum, i S 2 (R i ; ∆R i ) w(R i ), and it is approximately IRC safe up to binning errors.
The resulting IRC safe observables belong to C-correlators [60] which are multilinear forms of the energy flow.An n-linear C-correlator is expressed as follows, where w is a continuous function of R 1 , • • • , R n .For example, an inner product of P T ( R) and w( R) is a linear C-correlator, and an inner product of S 2 (R) and w(R) is a bilinear C-correlator with w depending only on the relative distance R 12 , Many well-known jet observables belong to the C-correlator, for example, a jet transverse momentum p T,J is a linear C-correlator with w( R 1 ) ≈ 1, a jet mass m J is a bilinear C-correlator with w( R 1 , R 2 ) ≈ R 2 12 /2.The S 2 (R) spectra use all the jet constituents, but it is useful to separate the correlations of constituents of the hard subjets; we consider jet trimming for this purpose.We recluster the constituents of a jet of a radius parameter R J to subjets with a smaller radius parameter R trim .A subjet J a is discarded if p T,Ja < f trim p T,J , where p T,J and p T,Ja are the transverse momenta of the jet and a-th subjet respectively.The trimmed jet J trim is defined as a union of the remaining subjets, The jet trimming is beneficial because it does not introduce additional angular scale parameters other than R trim .The trimmed spectrum is then calculated using the constituents of the trimmed jet.We denote it as S 2,trim (R) and its binned version S 2,trim (R; ∆R) which are defined as follows: where P T,trim ( R) is the energy flow of J trim .
In the limit of the constituents of each subjet J a are localized, the energy flow and the jet spectrum can be approximated in terms of the subjet momenta.The energy flow of such a jet is decomposed into a sum of energy flows of all the subjets, (2.13) The energy flow of each subjet converges weakly to p T,Ja δ( R − R Ja ).The S 2 (R) spectrum can be approximated by the momenta of the subjets, i.e., (2.14) The jet trimming also introduces a p T scale hierarchy among the subjets, and so their pairwise contributions to S 2 (R; ∆R) can be classified by the scale.We define a quantity S 2,soft (R; ∆R) where, S 2,soft (R; ∆R) = S 2 (R; ∆R) − S 2,trim (R; ∆R). (2.15) In the r.h.s. of the above equation, the correlations among the constituents of the hard subjets are canceled, and we have, ) (2.17) The dominant contributions to S 2,soft (R; ∆R) (i.e., the O [f trim ] terms) come from the correlations between a constituent in J trim and a constituent in J−J trim .The subleading O f 2 trim terms denote the correlations among the constituents in J − J trim .

Derivation of Classifiers based on Energy Flows and Jet Spectra
We discuss the relation between S 2 (R) and neural network classifiers trained on the energy flow P T ( R).A general softmax classifier that solves K-class jet classification problem can be expressed as a functional Ψi which maps the energy flow to real numbers h i , i.e., where w are the weights and biases of the output layer, and ŷ is the prediction of the classifier.Here the ϕ softmax is the softmax function whose k-th component is expressed as follows, Many jet classifiers can be expressed in the form of eq.(2.18).For example, in the cut-based analysis, Ψi are the jet substructure variables, such as the ratios of n-subjettiness [12], ratios of energy correlation functions [16,17] etc.The deep neural network classifiers, such as artificial neural network tagger [18], convolutional neural network using pixelated jet images [19], energy flow network [31] etc., are also described by eq.(2.18).The neural networks that are introduced in section 3 and section 4 also belong to this category.
The jet spectra S 2 and S 2,trim can be derived from eq. (2.18) using a functional Taylor expansion.The energy flow is decomposed by trimming as follows, One can express Ψi [P T,a ] as a functional series at a reference point P T,a ( R) = 0, where w i,a can be chosen as a constant if we are not interested in features depending on reference vectors, for example, jet axes, beam directions, etc.The linear term in P T ( R) of eq.(2.22) is related to the jet momentum p T,J and trimmed jet momentum p T,J,trim as follows, The second order coefficient w i,ab , the first non-trivial term of the series expansion, is a function of the relative distance of R 1 and R 2 .The basis vectors of w (2) i,ab are two-point correlation functions S 2,ab (R), The spectra S 2 and S 2,trim are expressed in terms of S 2,ab as follows, Instead of the energy flows, we consider a classifier of S 2,A (A = 1,2), where x kin is a set of additional inputs to the classifier based on the kinematics of the jet, and the index A is equal to 1 for "trim" and 2 for "soft".Similar to eq. ( 2.24), we expand eq. ( 2.27) around S 2,A (R) = 0 as where w are the corresponding weight functions.One may further truncate the series to get a linear form, i,A (R; x kin ). (2.29) The above-mentioned linear setup has an advantage on the interpretability and visualization of the network predictions; we discuss more on this network in section 4.

Spectra of Two-Prong Jets
The S 2 (R; ∆R) spectrum is useful to identify the substructures of the jet and also to characterize the jet.Typically, S 2 (R; ∆R) of QCD jets have a peak at R = 0 with a long tail towards large R.
The peak originates from the autocorrelation term i p 2 T,i in eq.(2.3).On the other hand, if a jet originates from a Higgs boson decaying into b b, the b-partons create two isolated cores inside the jet.The spectra have a peak at the angular scale equal to the angle between the two clusters.In addition, S 2 (R; ∆R) encodes the fragmentation pattern of b-partons.
At the LHC, boosted heavy objects such as top quark, gauge bosons and Higgs boson decaying into quarks can be studied by identifying jet substructures.Usually, these substructures are characterized by parameters such as D 2 defined as, where β is the angular exponent.If a jet has a two-prong substructure then D 2 is much less than one.The jet spectrum S 2 (R; ∆R) contains more information than D 2 , and therefore, the analysis with S 2 (R; ∆R) goes beyond the one using D 2 .It was shown that a neural network trained on S 2 (R; ∆R) distinguishes Higgs jet from QCD jet better than the one trained on D 2 [28].
To study the fragmentation pattern of the b-partons and their color connection to the mother particle, we introduce a color-octet scalar, sgluon (σ).We assume that the Higgs boson (h) and σ decay into b b though the interaction, The Higgs boson is a color singlet particle and the decay h → b b is isolated in color flows.Therefore, S 2 (R; ∆R) beyond the angle between the b-partons, i.e., R b b, is suppressed due to the color coherence.No such constraint on the angular scale exists for sgluon and QCD jets.Meanwhile, the Higgs jet and sgluon jet have the same two-prong substructure, unlike the QCD jet, as both are originating from a particle decaying into b b final states.
To study the spectra of two-prong jets, we simulate events as follows.We use Madgraph5 2.6.1 [68] to generate the events of pp → Zh, pp → Zσ, and pp → Zj processes with the Z boson decaying to a pair of neutrinos.These events are then passed to Pythia 8.226 [65] for the parton shower and hadronizations.To study the impact of the parton shower and hadronization schemes, those parton level events are also passed to Herwig 7.1.3[66,67].A color octet scalar UFO model [69,70] generated by Feynrule 2.0 [71,72] is used to simulate pp → Zσ process.The masses and widths of Higgs boson and sgluon are m h = m σ = 125 GeV and Γ σ = Γ h = 6.4 MeV.The detector response is simulated by Delphes 3.4.1 [73] with the default ATLAS detector configuration.We use FastJet 3.3.0[74,75] to reconstruct jets from the calorimeter towers using anti-k T algorithm [76] with the radius parameter R J = 1.0.For jet trimming, we use R trim = 0.2 and f trim = 0.05.We select the events with the leading jet transverse momentum p T,J ∈ [300, 400] GeV and its mass m J ∈ [100, 150] GeV.For Higgs jet and sgluon jet, we additionally require that the two b-partons originating from their decay are located within R J from the jet axis.More details on our simulations are described in appendix A.
QCD jet MG5+PY8+Delphes In the jet image, a red triangle is a position of a parton level quark in the jet, and a green "+" shows a leading gluon emitted from the quark.
Higgs jet MG5+PY8+Delphes The blue triangles in the jet image are positions of the parton level bottom quarks from the higgs decay, and a green "+" shows the position of a leading gluon emitted from a bottom quark.
sgluon jet MG5+PY8+Delphes In figure 1, we show the pixelated jet image (left panel) and S 2 and S 2,trim spectra (right panel) of a QCD jet.There are high energy deposits in the jet image near the jet center along with a wide spray of soft activity.It has also a moderate amount of radiation at (−0.4, 0.0).As a result, S 2 (R; ∆R) spectra has a long tail starting from R = 0.4.The jet trimming eliminates a significant amount of soft particles and, therefore, the tail does not appear in S 2,trim (R; ∆R).The remaining cross-correlations contributing to S 2,trim (R; ∆R) are the ones between the high and moderate energy deposits.Most of the energy deposits are concentrated at the center and the peak intensity at R = 0.4 is much lower than the intensity from autocorrelations at R = 0.
In figure 2, we show S 2 (R; ∆R) and S 2,trim (R; ∆R) distributions of a Higgs jet.For this particular event, S 2 (R; ∆R) distribution is similar to S 2,trim (R; ∆R) distribution and their difference S 2,soft (R; ∆R) is hard to be seen.No significant activity has been observed beyond the peak at R ∼ 0.8, mostly because the Higgs jet is very compact compared to the QCD jet.Correspondingly, there are two prominent subjets in the jet image, while most of the cells have no jets.
Finally, we show the S 2 (R; ∆R) and S 2,trim (R; ∆R) distributions of a sgluon jet in figure 3. The S 2 (R; ∆R) distribution has a large peak at R = 0.6 which is as significant as the one at R = 0.This spectrum is qualitatively similar to the Higgs jet in figure 2. However, the S 2 (R; ∆R) spectrum has a long tail beyond R J as compared with that of a Higgs jet.The tail disappears after jet trimming, like the QCD jet in figure 1, that makes the S 2,trim (R; ∆R) distribution more compact.From figure 1-3, we observe that S 2,trim and S 2,soft include useful information that are complementary.
In [28], it was shown that a neural network classfier trained on S 2 (R; ∆R) and S 2,trim (R; ∆R) spectra performs better than that of without S 2,trim (R; ∆R).The reason is that the hard and soft correlation terms in S 2 (R), i.e., O [1] terms in eq.(2.16) and trim ] in eq.(2.17) respectively, can be resolved by the jet trimming.Therefore, we use the orthogonal combinations, namely S 2,trim and S 2,soft , throughout this paper.
The S 2,trim and S 2,soft spectra encode the important features of the parton shower and fragmentation, and, thus, may be regarded as a well-motivated prototype.The hard partons evolve by the parton splittings i → i 1 i 2 , which are parameterized by the angle R i1i2 and momentum fraction z with p T,i1 = zp T,i and p T,i2 = (1 − z)p T,i .The splitting generates two-point correlation z(1 − z)p 2 T,i at R i1i2 .Therefore, S 2 spectra encode the parton splitting at any angular scale.

Classifying Higgs jet, sgluon jet, and QCD jet
In this section, we introduce a neural network trained on the jet spectra for classifying Higgs jet, sgluon jet, and QCD jet.We first discuss the basic kinematic features of these jets and then outline their dependence on the parton shower simulators.Afterward, we show the details of the neural network and then present our results in terms of the receiver operating characteristic (ROC) curves.

Basic Kinematics
In figure 4, we show p T,J and m J distributions for the Higgs boson, sgluon and QCD jets.The solid and dashed lines correspond to Pythia 8 (PY8) and Herwig 7 (HW7) generated jets, respectively.The mild differences in the p T distribution are due to the difference of their matrix elements.The Higgs jet is produced via s-channel process, while the sgluon and QCD jet are produced via tchannel and u-channel processes; hence, p T,J scalings are different.Not much difference is observed between the p T,J distributions of PY8 and HW7 samples.This is because p T,J is mostly determined by the matrix level p T of the leading parton and the jet algorithm with large radius parameter clusters most of the radiations from this parton into a single jet.However, the difference between m J distributions is large.The peak at m σ of sgluon jet is significantly broader than that of Higgs jet because radiations of the b-partons from the Higgs boson decay are mostly confined due to the color coherence but those of the sgluon are not.As a consequence, PY8 and HW7 generate different m J distributions.
We assume that both Higgs boson and sgluon have narrow-widths although sgluon width can be large.An increase in the width will broaden R b b distribution of σ → b b that has a peak at the characteristic angular scale, For example, the variation of R b b is only about 0.07 for p T,J = 300 GeV for Γ σ = 10 GeV and 0.05 for p T,J = 400 GeV.It is close to the calorimeter angular resolution ∼ 0.1 and does not affect the calorimeter level analysis.We first make a quantitative estimate of the radiation pattern inside the jet.To do so, we define two quantities comparing S 2 (R) spectra around Rb b, with a = 0.75, a = 1.25 and C = 40.The ratio R sym compares energy deposits around Rb b and in its surrounding angular scales [28].The ratio is sensitive to the correlation between the two hard substructures of the Higgs jet.On the other hand, The R rad is sensitive to the color of mother particle as it compares energy deposits in the large angular scales.We show the R sym distributions in the left panel of figure 5.The distributions of the Higgs jet and sgluon jet are similar because both of the S 2 (R) spectra have a peak at R b b.Meanwhile, the two-point correlations for the QCD jet are not localized around the R b b scale, so the R sym is smaller than that of a Higgs jet and a sgluon jet.In the right panel of figure 5, we show the R rad distributions.The R rad of the sgluon jet and QCD jet are large in average, while R rad is smaller for Higgs jet because large angle radiations are suppressed.The difference in R sym and R rad distributions between PY8 and HW7 samples is small; however, there is an appreciable difference in the restricted phase space.In figure 6, we plot R rad distributions after the selection, R sym > 0.85, so that the jets always contain two hard subjets with similar transverse momenta.The PY8 (solid line) and HW7 (dashed line) samples have significantly different R rad distributions for the Higgs jet.Such a difference is not observed for the QCD/sgluon jets.The observed deviation for the Higgs jets could be originating from the difference of the parton shower scheme.The angular-ordered shower is adopted in HW7.On the other hand, the p T -ordered shower is the default shower algorithm for PY8 where angular ordering is enforced by hand.An artificial veto in p T -ordered shower introduces the mismatch to the angular-ordered shower at double-leading log level [77,78].

Multilayer Perceptron of Spectra
We introduce a neural network trained on the kinematic and spectrum (S 2,trim and S 2,soft ) variables to classify the jets.A schematic diagram of the architecture of the classifier is shown in figure 7.
The following set of inputs is used, Here we take the bin width ∆R = 0.1, which is approximately the angular resolution of hadronic calorimeter of the ATLAS detector.Note that the maximum separation between any two constituents of the jet is 2R J .A multilayer perceptron (MLP) with L layers is used to map the inputs to the class prediction.The following first order recurrence relation between the layers describes an MLP, h where w i are the weights and biases of the -th layer.The activation function of -th layer, ϕ ( ) : R → R, is a monotonic and nonlinear function.We use four hidden layers with 1000, 800, 400 and 200 nodes, respectively, with a rectified linear unit (ReLU), ϕ ReLU (x) = max(0, x), as the activation function.This MLP will identify important features of inputs for the classification after training.To make a class prediction, the outputs of the MLP are provided to a softmax classifier in eq.(2.19).The whole network architecture is illustrated in figure 8.The MLP is trained by minimizing a loss function including categorical cross-entropy and L 2 weight regularization [79], where N events is the total number of events in the training data set, λ is a weight decay constant associated to the L 2 regularization and we choose λ = 0.01.The y i (ŷ i ) denotes the components of the truth (predicted) label vector y (ŷ).The L 2 weight regularization reduces the over-fitting on the training data and also allows smooth extrapolation to the phase space that is not covered by the training sample.The minimization is done with ADAM optimizer [80].After the minimization of the loss function, the softmax layer provides scores of the classes of a given event.The truth label vectors are defined as follows, (1, 0, 0) Higgs jet, (0, 1, 0) sgluon jet, (0, 0, 1) QCD jet. (3.9) The unnecessary symmetries in the neural network are broken by using the Glorot uniform initialization method [81].The weights in the hidden layers are initialized by assigning random numbers between [− 6/(N in + N out ), 6/(N in + N out )], where N in and N out are numbers of inputs and outputs of a layer, respectively.The biases are initialized to zero.All the inputs are standardized before training.The architecuture is implemented in Keras [82] with backend TensorFlow [83].In figure 9, we show ternary plots of the predicted label vector ŷ.The three sides of the triangle (starting from the base of the triangle and then counterclockwise) are ŷ1 , ŷ2 , ŷ3 axis; we denote them as ŷh , ŷσ and ŷQCD respectively.The ŷ distributions of the Higgs jet and QCD jet have high density spots that do not overlap with each other.It means that the network has found the exclusive features of those two kinds of jets.The two-prong substructure of a Higgs jet and the one-prong structure of a QCD jet are the exclusive features.However, the two-prong substructure of a sgluon jet is more radiative and less exclusive, and therefore, there are no high density spots in the ŷ distribution of the sgluon jet.
Next, we show ROC curves of binary classifications in figure 10 with the red dotted lines.The following signal-background classifications are considered: Higgs-QCD, sgluon-QCD, and Higgssgluon.We assign the truth label vectors y = (1, 0) for the signal and y = (0, 1) for background.The QCD jet mistag rates are comparable for both Higgs-QCD and sgluon-QCD classifications, however, the separation between the Higgs jet and sgluon jet is weaker.
We now compare these ROC curves with that of CNN trained on jet images. 3The CNN classifier takes 20 × 20 inputs of the jet images, while 2 × 20 inputs of S 2,trim and S 2,soft spectra are used for the MLP.The solid blue lines in figure 10 denote the ROC curves of the CNN.Some improvement in the background mistag rates is observed compared with the MLP classifier.Quantitatively, it is only, 0.2% (= 2.5% − 2.3%) at signal acceptance of 20% for Higgs-QCD classifcation.Note that the MLP trained on S 2,trim and S 2,soft is expressed by eq.(2.28), and the CNN is expressed by eq.(2.22) when we use large kernel sizes for each convolutional layer so that the CNN can be considered as an MLP trained on jet images.The missing terms which are not included in eq.(2.28) are the source of degradation.Nevertheless, the small difference means that S 2,trim and S 2,soft captures relevant features for these classifications.

Event Generator Dependence
The classifier introduced in the previous subsection uses not only the information of hard subjets encoded in S 2,trim but also the soft activities captured in S 2,soft as well.This leads to concerns about the accuracy of the models of soft physics.Specifically, the performance of the classifier could be sensitive to the soft activities in the jet while the simulated soft activities may be significantly different from the truth.
In figure 11, we compare the ROC curves of the MLP trained with PY8 and HW7 samples.As these two event generators are based on different modeling of parton shower and hadronization scheme, the comparison would give us a reasonable estimate of the systematic uncertainty originating from the generator choice.The ROC curves of the MLP trained on S2,trim and S 2,soft .The solid and dashed lines correspond to the classifier trained with PY8 and HW7 respectively.The red and blue lines correspond to the classifier tested with PY8 and HW7 respectively.We show the results of discriminating Higgs jet vs QCD jet (left), sgluon jet vs QCD jet (center), and Higgs jet vs sgluon jet (right).
In the left panel of figure 11, we compare the ROC curves of the Higgs jet vs QCD jet classification for different generator choices.By doing this exercise, we estimate a systematic uncertainty in the predictions of the classifier by comparing ROC(PY8, PY8) and ROC(HW7, HW7) curves, where the first and second entries in the parenthesis correspond to the generators used to simulate the training and test samples respectively.On the other hand, ROC(PY8, HW7) and ROC(PY8, HW7) show degradation of the performance of classifier trained on the "wrong sample" to analyse "real events".
The performance of the classifier improves as we vary generator combinations in the following order: ROC(HW7, HW7), ROC(PY8, HW7), ROC(PY8, PY8), and ROC(HW7, PY8).We find that the classification performance is significantly better for PY8 test samples than that of HW7 samples.On the other hand, the classification performance for the same test samples hardly depend on the classifiers, namely ROC(HW7, PY8) ∼ ROC(HW7,HW7) and ROC(PY8,HW7) ∼ ROC(PY8, PY8).For the Higgs jet vs QCD jet classification, the classifier mostly concentrates on the core substructures within the jet, and here both PY8 and HW7 provide similar kinematics and radiation spectra.Therefore, we do not observe any significant change in the ROC curves by varying training samples while keeping the test samples the same.
In the middle panel of figure 11, we compare the classifier performance for the sgluon jet vs QCD jet classification.It improves in the following order: ROC(HW7, PY8), ROC(PY8, HW7), ROC(HW7, HW7) and ROC(PY8, PY8).The classifiers are indeed sensitive to the choice of generators.The network trained with PY8 (HW7) samples have failed to capture the features of HW7 (PY8) test samples.The networks have focused on different portions of the distribution of the fragmentation functions.In the right panel of figure 11, the ROC curves for the Higgs jet vs sgluon jet classification show similar behavior.

Interpretable Two-level Architecture
Quantitative understanding of a neural network is not straightforward because the parameters and intermediate outputs of the neural network are less readable.In this section, we propose an architecture constructed from the truncated series in eq. ( 2.29) and try to explain quantitatively how this network classifies events.The discretized architecture is then defined as follows, where w k A are the trainable weights.This setup is essentially a logistic classifier on S k 2,trim and S k 2,soft .The loss function is the categorical cross-entropy as defined in eq.(3.8).In the case of binary classifications with a single h, change in w = −1 without loss of generality.After the training, the magnitude of S k 2,trim w k 1 or S k 2,soft w k 2 is high when the corresponding S k 2,trim or S k 2,soft is useful for the classification.Note that the logistic classifier does not take into account the p T,J dependence of Rb b.Therefore, we introduce a two-level architecture which is a variant of the logistic classifier.The weights w k A are calculated by a kinetic module Φ k A ( x kin ) of an MLP trained on x kin = (p T,J , m J ), A schematic diagram of this setup is shown in figure 12.The inputs x kin are standardized before training, and S k 2,trim and S k 2,soft are divided by their maximum value of the training sample because standardizing the spectra reintroduce the zeroth order term of eq.(2.28).This architecture is similar to the self-explaining neural network [84].The Φ k A is modeled with the MLP of two hidden layers with exponential linear unit (ELU) activations [85], The nodes of the successive layers are configured as 400 ELU, 200 ELU, 2 × 20 linear respectively.We do not use ReLU on modeling Φ k A because dead ReLU nodes with a zero gradient kill x kin dependency of the weights.This is known as the dying ReLU problem.In this case, the architecture reduces to the logistic classifier.
The vanishing gradient problem arises when the momentum range of the training sample is too small, which generates constant weights.The characteristic scale of Rb b is [0.625, 0.833] for the p T,J range [300, 400] GeV.The variation in Rb b is about 0.2 which is not significantly large compared with the calorimeter resolution of 0.1.Therefore, we extend the p T,J range of all the samples by [300, 600] GeV.In addition, we avoid vanishing gradient problem by using He uniform initializer [86].The weights and biases are initialized by uniform random numbers in [− 6/N in , 6/N in ] where N in is the number of inputs to the layer.The advantage of using He initializer over Glorot initializer is that it generates random numbers in a wider range so that the neural network can start up from wider initial weights and gradient.The weight decay parameter λ of the L 2 weight regularizer is set to 0.001 so that the weights do not vanish too early.
After the successful training, the performance of the two-level architecture is close to that of the MLP in section 3. The green dotted lines in figure 10 is the ROC curves of the classifier.The difference is smaller than the systematic uncertainty shown in figure 11.This makes a good reason to believe that the weights in eq. ( 4.3) captures the essential features of the MLP in section 3. The weights w k 1 (left) and w k 2 (right) of Higgs jet vs QCD jet classification.We show the weights for p T,J values in the range [300, 550] GeV while fixing the m J to 115 GeV.The weights at an angular scale smaller than 1.0 are magnified by 10.The value of w2 in the bin [0,0.1),i.e., w 0 2 , is approximately 16 for all the values of p T,J .
In figure 13, we show the weight functions w 1 (left) and w 2 (right) of Higgs jet vs QCD jet classifier trained with MG5+PY8+Delphes samples.Note that the weights are outputs of the neural network for the given x kin , and not the output with the sample at the indicated p T,J and m J .The weights in R > R J and w 2 in R < R trim are large compared to the weights in other angular scales.The S 2,trim and S 2,soft on these angular scales are typically smaller than that in other scales.Therefore, their weights become large to compensate for the energy difference when the corresponding value of the spectrum is useful for jet classification.The dotted lines in figure 13 denote the values of Rb b, defined in eq.(3.1), for different values of p T,J .
The w 1 around Rb b is positive because the correlation at the scale is a characteristic feature of Higgs jet.If a Higgs boson is decaying to a pair of bottom quarks perpendicular to the boosted direction in its rest frame, the relative angular separation of the decay products is Rb b in the lab frame.Due to the phase space of the decay, most of the events are distributed near θ = π/2, where θ is Higgs decay angle relative to the boost direction.As p T,J increases, Rb b decreases and the lower edge of w 1 > 0 moves toward smaller values.The region with w 1 > 0 also shifts towards smaller values of R. The weight w 1 on R > Rb b is positive for capturing a Higgs jet whose θ is smaller than π/2.These events have p T asymmetric subjets, and the cross-correlation terms in S 2,trim (R) are smaller than that of p T symmetric case.These correlations are still useful for the classification because S 2 spectrum of QCD jet reduces much faster than that of Higgs jet.As a result, the w 1 is an increasing function in this region to compensate for the S 2,trim reduction.
For R R J , the weight w 1 is negative.The score ŷh decreases whenever there are any energy deposits at R R J .The crossover point from w 1 > 0 to w 1 < 0 shifts towards smaller values of R with the increase of p T,J because the Higgs decay products become more asymmetric with respect to the boost direction.In such a case, one of the subjets tends to be soft so that the two-point correlations are included in S 2,soft , instead of S 2,trim .These contributions to S 2,soft do not affect w 2 , because S 2,soft of QCD jets are overwhelming at large R.
The S 2,soft on R > R trim always reduces ŷh and there is no prominent structure around Rb b.Moreover, |w 2 | decreases as p T,J increases.The reduction of w 2 is a compensation of the increase of S 2,soft and the prediction is more or less p T,J independent.On R > R J , |w 2 | increases with R because activity in this region is a sign of QCD jet even though corresponding S 2,soft decreases due to suppressed large angle radiations.
The w 2 on R R trim is positive and w 2 has a break at R ∼ R trim .Correlations between the constituents in a soft subjet contributes to the S 2,soft on R < R trim , i.e., S 2,soft (0; R trim ) ∝ f 2 trim .The reason why the weights are positive is follows.Let us assume J a is a single soft subjet, then S 2,soft (0; R trim ) ∼ p 2 T,Ja ∼ (p T,J − p T,J,trim ) 2 .If there are multiple soft subjets, then S 2,soft (0; R trim ) ∼ a p 2 T,Ja < ( a p T,Ja ) 2 ∼ (p T,J − p T,J,trim ) 2 .This triangular inequality suggests that the magnitude of S 2,soft (0; R trim ) is small for a jet with a given p T,J − p T,J,trim when there are multiple soft jets.The positive w 2 on R < R trim means that Higgs jet has less soft subjets than QCD jet.The S 0 2,soft consists of the autocorrelation of soft subjets, which has different energy scaling behavior compared with the other S k 2,soft .The S 2,soft on R < R trim is S 2,22 ∼ O[f 2 trim ] in eq.(2.25).On the other hand, S k 2,soft (k ≥ 1) is dominated by S 2,12 .The S 2,12 on R < R trim does not contribute to S 0 2,soft because it vanishes.Therefore, we may rewrite h as follows, where w 2 and w 2 are continuous functions with w 2 (R trim ) = 0 and w 2 (R) = w 2 (R) + w 2 (R).The second term is essentially the same as dR S 2,22 (R)w (2) 2,22 (R) in eq.(2.24) and the last term is dR S 2,12 (R)w  Higgs jet vs QCD jet classification.However, the |w 2 | is much smaller.This comes from the fact that S 2,soft of sgluon jet is similar to that of QCD jet and it is less important in the classification.
No peak of S 2,soft around R R trim also indicates that the soft substructures of sgluon jet are as radiative as QCD jet.Additionally, there is no color coherence restriction of soft radiations for the sgluon jet.This leads to small |w 2 | for R > R J .In the right panel of figure 14, we show the weights for the Higgs jet vs sgluon jet classification.The peak of w 1 around R = Rb b is small as the hard substructures of Higgs jet and sgluon jet are (almost) the same.However, a sgluon is more radiative than a Higgs boson, and w 2 is negative in the entire region of R < 1.5.
As described above, weights w 1 and w 2 may take large values, but it does not necessarily mean that the corresponding S 2,trim and S 2,soft contribute dominantly in the jet classification.The energy scaling factors on the S 2,trim (S 2,soft ) and its weight w 1 (w 2 ) cancel out in the quantity of our interest h = k (S k 2,trim w k 1 + S k 2,soft w k 2 ).For example, O [1] terms in S 2,trim and O[f trim ] terms in S 2,soft contribute equally to the classifier if w 2 is around f trim w 1 .In the left panel of figure 15, we draw the mean values S k 2,trim w k 1 and S k 2,soft w k 2 of Higgs jet vs QCD jet classification, which are more directly related to the jet classification.The solid and dashed red lines correspond to the distributions of the Higgs jet, while the solid and dashed blue lines are for the QCD jet.The regions where Higgs jet and QCD jet distributions differ significantly are important for the network predictions.We find S k 2,trim w k 1 around R ∼ Rb b and S k 2,soft w k 2 in the region R < 1.2 mostly contribute to the jet classification.
The average distribution may not illustrate all the features of the classifier performance.The energy deposits in each bin fluctuate, and the bins with hits higher than the average value contribute more to the network decisions.For example, soft emissions outside the angle between the two hardest subjets are rare in the Higgs jet.Once there is large angle radiation outside the cone of hard subjets, the network is likely to identify the jet as a QCD jet.In the right panel of figure 15, we plot the S k 2,trim w k 1 and S k 2,soft w k 2 distributions of the Higgs jet (QCD jet) with ŷh (ŷ QCD ) higher than 95th percentile.The distributions indicate that the selected Higgs jets are mostly classified because of the S 2,trim excess at R ∼ R b b, while the QCD jets are classified using S 2,soft excess above R > 0.2.We now use the architecture to compare PY8 and HW7.As we have already shown in section 3, the performance of the classifier depends on the event generators significantly.In figure 16, we show the weights of the classifiers for p T,J = 350 GeV and m J = 115 GeV trained with a) Higgs jet, b) sgluon jet and c) QCD jet.For each plot, the signals are PY8 events and the background are HW7 events.The w 1 in R R J is close to zero everywhere, representing S 2,trim spectra of PY8 and HW7 events are similar.It is not surprising because both PY8 and HW7 events from identical hard partons and these partons create the trimmed subjets inside the jet.On the other hand, correlation involving constituents of the soft activities, S 2,soft , is manifestly different and so w 2 is nonzero.For Higgs jet and sgluon jet, the w 2 distribution of PY8 events is significantly large (and positive) for R ∼ R trim and it decreases as R increases.The weight w 2 is negative for R > R J , which means that HW7 events have more soft activity in the region R Rb b.
For the case of QCD jet, the distribution of w 2 is always positive and flat for R < R J and negative for R > 1.5.It would be interesting to evaluate the weights for the classifiers trained with the experimental data and compare with the simulated results to tune the parameters of the event generators further.insensitive to the choice, however, the weights of the S 2,soft are strongly affected.This behavior is expected as modeling of soft physics is quite different in Pythia 8 and Herwig 7.
The two-point correlation spectra and the architectures introduced in this paper can be applied for solving other interesting problems, thanks to flexibility on designing neural network.For jets with more complex substructures, e.g., top jet, the higher order terms in the energy flow series expansion may be included.It would be worthwhile to study the classifier performances when the network is trained with the experimental data and compare with the predictions of event generators to tune their parameters to reduce the uncertainty in modeling the soft physics.It is also interesting to use this interpretable architecture as a model-agnostic interpreter for black box architectures [87].We leave these possibilities for future works.

B Oversampling and p T,J -bias removal
The neural networks may learn the inherent p T,J difference among Higgs jet, sgluon jet, and QCD jet in figure 4 to classify them instead of learning the difference in their substructures.To penalize the learning from the p T,J distribution, we augment the training and validation samples by oversampling as follows, 1.The samples (of each class) are binned in p T,J with bin-width 1 GeV with b i entries in the i-th bin and b max = max{b i }.
2. For each bin, oversample the events so that the number of the bin contents becomes a certain value n max = c d b max .This oversampling is identical to repeating events in bin i sequentially for times and stop the oversampling when the number of events in the bin reaches n max .It may introduce a small bias due to unequal oversampling among different bins.We choose c d = 2 and ignore the small residual bias.
In the upper (lower) panel of figure 17, we show conditional probability density of the predicted class vector ŷh (ŷ σ ) for a given p T .The probability density has a mild dependence on p T,J which is originating from the interplay of the phase-space selection and the jet radius parameter.We obtain and pre-process the jet image as follows, 41.Recluster the jet constituents by k T algorithm with a jet radius parameter R J = 0.2.

C Jet Image and Convolution Neural Network
2. Set the center of (η, φ) coordinate to the leading (in p T ) subjet.
3. If a second leading subjet is found, rotate the jet constituents on (η, φ) plane about the jet center so that the sub-leading jet is on the positive y-axis.
4. If a third leading subjet is found, flip the image about y-axis when x coordinate of the subjet is negative.where bin k,l is the region of (k, l)-th bin.
The first convolutional layer deals with angular scale up to 0.3 to treat collinear radiations while the second convolutional layer operates up to 0.8.The outputs of Layer 4 are flattened into a onedimensional array and concatenated with a set of kinematic inputs, {p T,J , m J , p T,J,trim , m J,trim }.The flattened output is fed into two successive 1 × 1 convolutional layers [92] with 300, 100 filters respectively and ReLU activation function.The output is fed into a softmax layer to make a prediction.The training setup is the same as in section 3.

Figure 1 .
Figure 1.A jet image (left) and the corresponding S2 (blue) and S2,trim (red) spectra (right) of the leading jet of a pp → Zj event.In the jet image, a red triangle is a position of a parton level quark in the jet, and a green "+" shows a leading gluon emitted from the quark.

Figure 2 .
Figure 2. A jet image (left) and the corresponding S2 (blue) and S2,trim (red) spectra (right) of the leading jet of a pp → Zh event.The blue triangles in the jet image are positions of the parton level bottom quarks from the higgs decay, and a green "+" shows the position of a leading gluon emitted from a bottom quark.

Figure 3 .
Figure 3.A jet image (left) and the corresponding S2 (blue) and S2,trim (red) spectra (right) of the leading jet of a pp → Zσ event.The blue triangles in the jet image are positions of the parton level bottom quarks from the sgluon decay, and green "+" show the position of a leading gluon emitted from a bottom quark.

Figure 4 .
Figure 4.The distribution of p T,J (left) and m J (right) of the leading jet.The red, green, and blue solid (dashed) lines correspond to the Higgs jet, sgluon jet, and QCD jet of PY8 (HW7) samples, respectively.

Figure 6 .
Figure 6.The distribution of R rad for Higgs jet (red), sgluon jet (green), and QCD jet (blue) with an additional selection of Rsym > 0.85.The solid (dashed) lines correspond to the jets of PY8 (HW7) samples.

Figure 9 .
Figure 9. Ternary plots of the predicted label vector ŷ of the MLP for the Higgs jet (left), sgluon jet (center), and QCD jet (right).

Figure 10 .
Figure 10.The ROC curves of the binary classifiers: the MLP trained on S2,trim and S 2,soft (red dashed), the CNN trained on jet images (blue solid), and the two-level architecture (see section 4) trained on S2,trim and S 2,soft (green dotted) with PY8 samples.We show the results of Higgs jet vs QCD jet (left), sgluon jet vs QCD jet (center), and Higgs jet vs sgluon jet (right) classifications.

Figure 11 .
Figure 11.The ROC curves of the MLP trained on S2,trim and S 2,soft .The solid and dashed lines correspond to the classifier trained with PY8 and HW7 respectively.The red and blue lines correspond to the classifier tested with PY8 and HW7 respectively.We show the results of discriminating Higgs jet vs QCD jet (left), sgluon jet vs QCD jet (center), and Higgs jet vs sgluon jet (right).
the classification results because the components of softmax function are monotonic.We set w

Figure 12 .
Figure 12.A schematic diagram of a two-level architecture.An MLP trained on p T,J and m J generates weights w1 and w2 for analysing radiation patterns encoded in S2,trim and S 2,soft spectra.Double bordered boxes represent trainable modules.
Figure 13.The weights w k 1 (left) and w k 2 (right) of Higgs jet vs QCD jet classification.We show the weights for p T,J values in the range [300, 550] GeV while fixing the m J to 115 GeV.The weights at an angular scale smaller than 1.0 are magnified by 10.The value of w2 in the bin [0,0.1),i.e., w 0 2 , is approximately 16 for all the values of p T,J .

Figure 14 . 2 ( 2 (Figure 15 .
Figure 14.The weights w1 and w2 for the sgluon jet vs QCD jet (left), and Higgs jet vs sgluon jet (right) classifications.We show the weights at p T,J = 350 GeV and m J = 115 GeV.

Figure 17 .
Figure 17.Validation of the pT -bias removal on the training samples.Each horizontal line is a conditional probability density on a given p T,J range, i.e., the weight sum of each horizontal line is 1.Upper (lower) panel displays the conditional probability density histograms of the predicted class vector ŷh (ŷσ) for the Higgs jet (left), sgluon jet (center), and QCD jet (right).

Figure 18 .
Figure 18.A schematic diagram of a convolutional neural network trained on jet image.The double bordered boxes represent trainable modules.

5 .
Select jet constituents within [−1.5, 1.5] ⊗ [−1.5, 1.5].6.Finally, pixelate the jet constituents with pixel size 0.1 × 0.1.The (k, l)-th pixel intensity P k,l T is determined by the total transverse energy of the jet constituents present in a given pixel, i.e., P k,l T = i∈J p T,i I bin k,l ( R i ) = bin k,l d R P T ( R) (C.1)