Neural Network-based Top Tagger with Two-Point Energy Correlations and Geometry of Soft Emissions

Deep neural networks trained on jet images have been successful in classifying different kinds of jets. In this paper, we identify the crucial physics features that could reproduce the classification performance of the convolutional neural network in the top jet vs. QCD jet classification. We design a neural network that considers two types of substructural features: two-point energy correlations, and the IRC unsafe counting variables of a morphological analysis of jet images. The new set of IRC unsafe variables can be described by Minkowski functionals from integral geometry. To integrate these features into a single framework, we reintroduce two-point energy correlations in terms of a graph neural network and provide the other features to the network afterward. The network shows a comparable classification performance to the convolutional neural network. Since both networks are using IRC unsafe features at some level, the results based on simulations are often dependent on the event generator choice. We compare the classification results of Pythia 8 and Herwig 7, and a simple reweighting on the distribution of IRC unsafe features reduces the difference between the results from the two simulations.

The pattern of soft radiation is also important for the classification. For example, a color singlet boosted heavy particle has emission isolated in terms of soft activity unlike quark and gluon jets. The related substructure quantity has been incorporated in Higgs taggers [71,72] and top taggers [73]. Such soft particle distribution may also contribute to the jet classification using neural networks in order to improve the performance.
While the improvement using deep learning is impressive, the physics behind it has not been addressed. So far, the classifier based on a convolutional neural network (CNN) trained on the jet image performs well for selecting the top jets. It is numerically shown that the CNN uses IRC safe features mostly [74], but it is not easy to make an estimate of systematic uncertainties from various sources without knowing what kind of features of the jet is used in the model. Bayesian networks are capable of tracking those uncertainties [75,76], but it is also useful to identify the features in order to interpret the network outputs and uncertainties. The aim of this paper is to provide convenient parametrization of the jet feature contributing to the classification using jet images.
In this paper, we address the question in the following steps. In section 2, we first introduce a graph neural network [77][78][79][80][81] with constraints, and the network is more restrictive than CNN. The graph network accesses to only IRC safe two-point energy correlations [17,19,[62][63][64][82][83][84][85]. It was shown that the network has comparable performance to the CNN in the Higgs jet vs. QCD jet classification [19]. We use this network for top jet vs. QCD jet classification, and it is a good starting point toward the network whose top tagging performance is comparable to the CNN.
To integrate the IRC unsafe quantities to this framework, we formulate a sequence of novel morphological measures based on Minkowski functionals, in section 3. The sequence includes the number of pixels with finite energy deposit (active pixels), N (0) , the number of pixels that touch the active pixels, N (1) . These numbers can be considered as a discretized version of Minkowski functionals. They are formulated in a mathematical theory called integral geometry and describes geometric measures to the point distributions. The application of the Minkowski functionals has already been considered in the astrophysical analysis [86][87][88][89][90][91][92][93][94][95][96][97][98][99], and statistical mechanics [100][101][102]. We perform a morphological analysis to the distribution of soft activity in the jet.
When the first few elements of the Minkowski sequence are included in the graph network inputs, the new classifier has the same performance as the jet image CNN classifier, as shown in section 4. This means that the improvement of the CNN classifier comes from the geometric quantities of the pixels, and also it is summarized by just a few numbers of additional variables. Our result suggests that the CNN output is correlated to a few numbers of geometric quantities derived from the jet image.
In the collider study, event simulators are used extensively to estimate the signal and background distributions. The sequence of Minkowski functionals calculated from a jet image is IRC unsafe quantities, and the simulated data need to be calibrated by the experimental data. We propose an event reweighting method based on the IRC unsafe quantities for the calibration in section 5. We conclude in section 6.

IRC Safe Two-Point Energy Correlations and Relation Network
The jet classifier using a deep learning model trained on the jet image has achieved better performance compared with the other statistical methods, but it is not easy to identify the key features that contributed to the improvement. To identify the crucial features used in the classification, we consider flexible and interpretable quantities derived from the jet image and use them as inputs to a jet classifier modeled by a multilayer perceptron (MLP). Additional inputs are considered until the performance of the classifier is equivalent to that of the best classifiers using the jet image.  Figure 1: A schematic diagram of the graph representation of a jet used in this paper. Each vertex corresponds to a jet constituent, and a line between two circles represents the variable calculated from the two vertices. Each dashed rectangle represents a subjet that contains the enclosed jet constituents.
We first introduce two-point energy correlation spectra S 2 [17,19] as a function of the distance between the jet constituents R, where S 2,ab is a shorthand notation of S 2,JaJ b ; is an energy flow of a subjet J a of a jet J; a and b are indices of the subjet. The R ij is the relative angular distance between two constituents, (η i − η j ) 2 + (φ i − φ j ) 2 . The S 2,ab is an IRC safe quantity. For the Higgs jet vs. QCD jet classification, an MLP trained on the transverse momenta, masses, and S 2 's of the jet and trimmed jet performs nearly as good as a CNN trained on jet images. In [19], we relate S 2,ab to the generic jet classifiers through its formal expansion with respect to the energy flow. This shows that S 2,ab is flexible enough to describe many quantities for the classification of jets.
In this section, we first derive S 2,ab in terms of a vertex-labeled fully-connected graph to integrate them into a framework of the graph network and extend it for further ML analysis. A graph is a set of the points and the lines connecting between them, which are called vertices and edges, respectively. In our setup, each vertex of the graph corresponds to a jet constituent, and the inputs to the i-th vertex are the jet constituent momentum p i . The labels of a vertex denote the subjets to which the constituent i belongs. The graph network also has the other inputs u calculated from the given jet, for example, (sub)jet transverse momentum and mass. A schematic diagram of the graph is in figure 1. Each circle represents the jet constituent assigned to the corresponding vertex. The dot-dashed lines are the edges.
We use a kind of graph network called a relation network [79,80] that mainly utilizes correlations between two vertices. The reason for using this network is that the kernel of the parton shower model is 1 → 2 splitting of partons. The classifier can focus on the two-point correlations by using the relation network as a functional model. The classifier output u is the value of a functional model φ u applied to the edge outputsē ab , the vertex outputsp a , and the predefined inputs u. u = φ u (ē ab ,p a , u) .   The edge outputē ab is the aggregated two-point correlation between J a and J b , where φ e ab (p i , p j , u) is a functional model of a two-point correlation assigned on edge linking two jet constituents i and j. The vertex outputp a is the aggregated one-point correlation of J a , where φ v a (p i , u) is a functional model of a one-point correlation assigned to a vertex that corresponds to a jet constituent i. The correlationsp a andē ab are symmetric for the permutation of the jet constituents. We train u to be the logits for the classification. 1 We use energy correlators [60,69] for φ e and φ v to restrict u to be IRC safe. Namely, we consider the following IRC safe C-correlators forp a andē ab , where p T,i is the transverse momentum of the i-th constituent, R i = (η i , φ i ) is the pseudorapidityazimuthal coordinate of the i-th constituent. The functions w a and w ab are the angular weighting functions of one-point and two-point energy correlators, respectively. The last step of the equation comes from the assumption that the classifier does not depend on the absolute angular coordinates of the (sub)jet constituents but uses the relative angular distances. The last expression in eq. (2.8) can be written in terms of an integral [17], We may absorb the angular weighting functions w ab to φ u so that the S 2,ab and p T,Ja can be considered as effective inputs to the network.
This setup is equivalent to the one using S 2,ab as input, discussed in [17]. We now design a top tagger based on eq. (2.10). The structure of the graph is specified by the subjet label a and b of S 2,ab . We consider the following subjet labels for the top jet vs. QCD jet classification.
• the trimmed jet, J trim , denoted by h, • the compliment set of J trim , J \ J trim , denoted by s, • the leading p T subjet, J 1 , denoted by 1, • the compliment set of J 1 , J \ J 1 , denoted by c, Examples of the vertex-labeled graphs are in figure 2. Note that the following relations hold for S 2 and S 2,ab , Because S 2,ss contains only the correlations between soft constituents, which is theoretically unpredictable and less reliable experimentally, we define the following combinations as in [17].
Here, b is a bottom quark from a top quark decay, and q andq are quarks from the subsequent W boson decay. Figure 3(a) is the S 2,trim of the top jet that has those four peaks clearly. This pattern is relatively rare for QCD jets. Figure 3(c) is the S 2,trim of a typical QCD jet.
In the case where the characteristic angular scales of the top quark, R bq , R bq , and R qq , are close to each other, it is not possible to see all peak structures in the S 2,trim (R) distributions. Such an example is shown in figure 3(b), although the relative strength of the peaks in the S 2,trim distribution contains partial information of the three-prong structures. 2 The information of the three-prong substructure is more clearly encoded in S 2,11 , S 2,1c , and  figure 2(a). The intensity of S 2,cc is much smaller than S 2 because the subleading p T jets have small transverse momenta. The magnified distribution of S 2,cc is shown in the green histogram. The dashed lines are the characteristic angular scales at the parton level.
factorizes the identification of a three-prong structure into that of two-prong substructures and its relative position from the J 1 . Those S 2,ab in parton level are as follows, where i k is the k-th leading p T parton. Figure 4 shows that the two peaks are in S 2,1c and the other two peaks are in S 2,cc . Figure 5 is the case where values of R bq and R bq are similar. The S 2,cc distribution has a peak at R ≈ 0.6, and the peak intensity is comparable to that of the peak at R = 0 because the J \ J 1 has a two-prong substructure. In addition, the S 2,1c distribution suggests that the high p T constituents of J \ J 1 are away from J 1 by a distance of 0.5.

Morphological Analysis of Soft Emissions
The number of particles of top jets and QCD jets is significantly different. For the boosted top quark decaying hadronically, i.e., t → bW → bqq , the significant fraction of energy goes to color singlet W boson. The number of particles in a top jet is less than that of a gluon jet with the same jet mass and momentum, and the particles are concentrated near the quark directions. The number of active pixels in the jet image, N pixel , is correlated to the number of particles in the jet, and therefore, it should be a crucial quantity of the jet image in the classification. This quantity is IRC unsafe as E → 0 and depends on the physics at a low energy scale, and its accuracy of the theoretical prediction is limited. 3 Indeed PY8 and HW7 predict significantly different pixel distributions for gluon jets, even though they are tuned to the experimental data.
To generalize the idea of N pixel , we introduce a morphological analysis of soft emissions on jet images. We consider two morphological operations: dilation, and filtering. Let N (i) be a number of pixels in a dilated image, where V (i) is the Minkowski sum of the set of the (η, φ) coordinate vectors R i of the active pixels, V (0) , and a set of discrete coordinate vectors on a square for dilation, i.e., We denote a set of pixels whose centers belong to V (i) as P (i) . Note that P (0) is identical to the set of active pixels, and N (0) is N pixel . The set P (i) is then a cover of the jet image, i.e., it is a union of the squares that attached to each active pixel. The covers obey a recurrence relation that P (i) includes pixels in P (i−1) and those touching one of the edges or corners of P (i−1) . This morphological mapping is illustrated in figure 6. Note that N (i) proportional to the area A (i) of pixels in the cover P (i) because each pixel has the same area (∆R) 2 , i.e., . The most left figure shows the active pixels P (0) , the figure at the center shows the pixels whose centers are B (1) , and P (1) is shown in the right figure. Figure 7: Illustrations of P (i) for (a) an isolated pixel, (b) a line of pixels, (c) a 5 × 5 square of pixels, (d) a ring of six pixels. For each plot, black pixels belongs to P (0) ; dark gray, light gray, blue pixels are the difference P (i) \ P (i−1) for i = 1, 2, 3, respectively.
For the analysis of soft activity, we consider a filtered image whose active pixels have p T larger than E. Let N (i) (E) be the number of active pixels in the filtered image. If we choose sufficiently large threshold E, the number N (i) (E) is relatively stable against the choice of the event simulators. The difference between the values of N (i) and N (i) (E) will provide us geometric information about the soft activity. The sequence of N (i) gives a quantitative description of the spatial distribution of pixels in the jet. Before going into some mathematical background, let us capture the idea using simple examples. Consider the relations between N (0) and N (1) of figure 7(a), figure 7(b), and figure 7(c): 1. Active pixels are separated by two or more pixels.
This corresponds to the limit of sparse and scattered pixels.
2. Active pixels are aligned on a line.
This case is the limit when soft activities come from a very narrow color string between two quarks at each end.
3. Active pixels are clustered in a square.
QCD jet MG5+PY8+Delphes Figure 8: The Minkowski sum P (i) of the top jet image and QCD jet image. For each plot, black pixels belongs to P (0) dark gray, light gray, blue pixels are the difference between P (i) − P (i−1) for i = 1, 2, 3, respectively. This is the limit of a one-prong jet such as quark jet.
The ratio N (1) /N (0) in large N (0) limit is approximately 9, 3, 1, respectively. If pixel clusters appear at small angular scale, N (1) /N (0) is reduced. Therefore, N (1) /N (0) quantifies the level of isolation of the pixels. Figure 8 shows P (i) of top and QCD jet images in figure 2(b) and figure 2(c), respectively. One quick observation is that P (i) has some non-trivial structures for small i (i = 0, 1), but the pixels quickly merge into a single cluster as the index i increases. In the large angular scale (i ≥ 2), the only relevant physics for the top jet vs. QCD jet classification is the color charge of the parent parton, and the N (i) does not carry significant additional information. In the next section, we show that N (0) and N (1) are sufficient to describe the soft structure contributing to the top jet and QCD jet classifier modeled by CNN.
The analysis based on the pixels can be generalized to the particle level analysis with a continuous parameter R as follows. Let P(R) be a cover of particles on (η, φ) plane.
where B i (R) is a disk with radius R and whose center is the direction vector R i of a particle i. The area A(R) of the cover P(R) is a quantity related to N (i) , i.e., A (i) can be considered as a discrete analog of A(R). The change of A(R) with respect to R also quantifies the spacial distribution of particles. As far as all the disks are isolated, A(R)/(πR 2 ) is the number of particles. The ratio A(R)/R 2 decreases at the scale where the disks start overlapping. Therefore, the profile of A(R)/R 2 along R encodes all the distances between particles.
A more general description of these morphological measures can be obtained from the integral geometry. According to Hadwiger's theorem [103], any geometric measure that has a notion of the size of a polyconvex set in Euclidean space R d can be described by a set of functions called "Minkowski functionals." The polyconvex set is a finite union of closed and bounded convex bodies. More precisely, the geometric measure v should satisfy the following properties, • Valuation: v has a notion of size of a set. The value of v of the empty set φ is zero, i.e, v(φ) = 0, and v satisfies the following inclusion-exclusion property, where B 1 and B 2 are polyconvex sets.
• Invariance: v is invariant under any rotation and translation.
• Continuity: For any sequence of polyconvex sets B n that converges 4 to B, its valuation v(B n ) also converges to v(B).
Such geometric measures can be represented as a linear combination of d + 1 Minkowski functionals In d = 2, we have three Minkowski functionals: area, perimeter, and Euler characteristic. Since P(R) on (η, φ) plane is a finite union of closed and bounded convex bodies B i (R), its geometry can be described by the Minkowski functionals. We already discussed the area A(R) of P(R), and its perimeter L(R) and Euler characteristic χ(R) are also useful quantities. The discrete analog of L(R) and χ(R) can be used for analyzing jet images. The Minkowski functionals show that N (1) carries independent information to N (0) . We denote the perimeter and Euler characteristic of P (i) as L (i) and χ (i) . If all the active pixels of a jet image are isolated enough, we may represent 5 The relation between N (0) and N (1) is then as follows, Note that this relation only holds when the squares attached to active pixels do not overlap each other. Once some squares start to overlap, the relation begins to deviate, and the persistence of this relation can be used as a morphological indicator for the topological change of P (i) . Therefore, N (0) and N (1) are effective variables for analyzing the geometry of soft particles of the jet image. Figure 7(d) is an example that the sequence of P (i) shows a non-trivial topological change. The sequence starts with six isolated pixels, P (1) and P (2) are a ring, and P (3) is a single large cluster. The Euler characteristic χ (i) and the perimeter L (i) of P (i) are as follows.
The non-monotonic behavior of the sequence of Minkowski functionals for analyzing the topology of point distributions is often discussed in other literature [86,87]. Utilizing this topological information for jet classification problems or global event topology analysis might be interesting, but the full analysis of the sequence of the Minkowski functionals is outside the scope of this paper. Morphological analysis have been applied in physics to quantify the distribution of the objects. In [86,87], Minkowski functionals are used to identify the structure of the astrophysical objects. In more recent papers, persistence topology turns out to be useful tool for charcterizing seemingly random distribution of the points and applied in analysis of cosmic microwave background [104] and string landscape [105]. It is tempting to consider other roles of morphological analysis with Minkowski functionals in jet classifications.

Top Tagger based on Relation Network and Jet Morphology
In this section, we describe our setup of classifiers trained on the inputs discussed in the previous sections, S 2,ab and N (i) . These inputs are derivable from jet images, so the CNN performs better than those RNs in principle. We show that the deep learning on the small number of derived inputs reproduces the performance of the CNN. Therefore, those inputs are associated with the relevant physics for solving the classification problem. Moreover, the network using the derived inputs typically has less overfitting than that using the raw inputs.

Training Data and Model Implementation
We simulate top jet and QCD jet samples by Madgraph5 [106], followed by Pythia 8 (PY8) [107] or Herwig 7 (HW7) [108,109]. The detector response of generated events is simulated by Delphes [110]. Jets are reconstructed by the anti-k T algorithm with radius parameter R J = 1.0. The jet constituents are calorimeter towers with angular resolution approximately ∆R = 0.1. The details of the sample preparation are explained in appendix A.
We categorize the inputs to the RNs into the four sets: x trim , x J1 , x kin , and x geometry .
• x trim is a set of discretized S 2,trim and S 2,soft up to angular scale R = 1.5, 2,ab is the binned spectrum of S 2,ab , with bin size ∆R in order to keep the same angular resolution to the jet image, We may consider the angular scale up to the diameter 2R J = 2.0, but S (i) 2,trim and S (i) 2,soft at such large R are less useful [19].
• x J1 is a set of discretized S 2,11 , S 2,1c and S 2,cc as follows, Again, we consider spectra only up to the relevant angular scales. For S (i) 2,11 and S (i) 2,cc , the scale is the diameter of the corresponding subjet but too large angular scale is ignored. For S (i) 2,1c , the scale is the jet radius because it is the correlation between the core part J 1 and its surroundings.
• x kin is a set of global inputs, In addition to the transverse momenta, we include the masses as the inputs because 2m Ja /p T,Ja is a characteristic angular scale of J a .
• x geometry is a set of the numbers of pixels of the jet images P (0) and P (1) , We modularize the implementations of the model outputs u = φ u (x) to avoid the curse of dimensionality. When inputs are too many, there is a potential danger of overfitting due to sparsely distributed samples. In our previous work, we use ∼ 40 inputs for the classification of Higgs jets and QCD jets [17,19]. The inputs for the classification of top jets and QCD jets are increased to ∼ 70, and training of a simple MLP classifier on these inputs may have difficulties. Therefore, we compress x trim and x J1 to a smaller number of hidden variables h trim and h J1 by a neural network and get u from them. The following is the closed-form expression of RN S2 that uses only the IRC safe inputs: x trim , x J1 , and x kin .
where MLP a is a multilayer perceptron (MLP) and θ a are its trainable parameters. We provide x kin to each network to tell the characteristic angular scales directly. We use the exponential linear unit (ELU) [111] as the activation function of each MLP. The dimensions of h trim and h J1 are 5.
The output u is a dimension two vector and will be transformed into the softmax outputs for the binary classification purpose.ŷ When the geometric information x geometry is included in the inputs, we use them as arguments of MLP logit , We consider three additional relation networks that uses the geometric information: RN S2,N (0) , RN S2,N (0) ,N (0) (4 GeV) , and RN S2,N (0) ,N (1) . Their inputs are listed in table 1. The detailed implementations of these RNs are in appendix C.1. The softmax output is trained by minimizing the cross-entropy loss function. In addition, we marginalize the p T,J distribution in the classification because the top jet samples and QCD jet samples have different p T,J distributions. To do this, we train networks in a way that interpolates binary classifiers for the jets at given p T,J . The corresponding cross-entropy loss function L CE is as follows. where Y is a category label, y top = (1, 0), and y QCD = (0, 1). The function fx |p T ,J (x; Y ) is the conditional probability density ofx given p T,J , andx is x without p T,J . The integral can be approximated by a Monte-Carlo integration, where f p T ,J (p T,J ; Y ) is the probability density function of p T,J given Y , and the variables with superscript [i Y ] is the value at the i Y -th sample in the training dataset of Y . The probability density function f p T ,J (p T,J ; Y ) is modeled by kernel density estimation (KDE) described in appendix B. The resulting loss function is essentially a cross-entropy with samples whose p T,J distribution is reweighted to be uniform. In addition to this cross-entropy loss, L 2 regularizer L reg [112][113][114] with the weight decay constant λ = 0.001 is added to regularize MLP a .
where W a are the weights of hidden layers in MLP a . The training setup is as follows. We minimize the loss function L(θ) = L CE (θ) + L reg (θ) by ADAM optimizer [115] with learning rate 0.001, the first moment exponential moving average coefficient β 1 = 0.9, the second moment exponential moving average coefficient β 2 = 0.999, and stabilization constant = 10 −7 . Batched samples are used in order to reduce overfitting. The weights of the MLP are initialized by the He initializer [116], and the biases are initialized to be zero. We will use early stopping for the termination criterion, but there is a chance that the network is mildly overfitted to the validation dataset during learning the features of the rare events. If the gradients from the rare events distort the trained results for the dominant events, the network parameters have to be corrected again, and the training becomes noisy. The random overfitting to the validation sample occurs during this noisy learning on the rare events. To avoid this artifact, we use the exponential moving averagesθ (t) of the trainable parameters θ (t) at the epoch t for the validation and testing. The details of the moving average can be found in appendix D. We monitor the loss L tot (θ) of the validation samples during the training and terminate the training if the loss function does not improve during 50 latter epochs. The networks and training setup is implemented in Keras [117] with TensorFlow [118] backend. Optimization on the batch number is performed by the grid search. We iterate the training for batch numbers, 20, 50, and 100 and two different random number seeds.
The results of the RN-based classifier will be compared to that of a CNN-based classifier. The CNN model is similar to that of the previous paper [19] but with more nodes and layers. The closed-form expression of the CNN is as follows,  . Surprisingly, when we consider all the geometric inputs x geometry , the ROC curve of RN S2,N (0) ,N (1) is almost equal to that of the CNN. Therefore, the inputs x trim , x J1 , x kin , and x geometry can be considered as useful middle-level variables for modeling the top jet classifier. The reason for a big gap between the ROC curves of RN S2 and CNN is the difference in N (0) distributions between top jet samples and QCD jet samples. The QCD jets in this paper are leading p T jets of pp → jj so that they are mostly gluon jets, which have a large N (0) than a jet from a color triplet parton. In addition, PY8 predicts significantly higher N (0) of gluon jets than HW7, as in figure 10. Similar situations have been pointed out for the counting variables such as the charged track multiplicity [65], and the soft drop multiplicity [70].

Classification results
The situation may be compared with the classification of Higgs jets and QCD jets, studied in [17]. In this case, the difference between the ROC curves of RN S2 and CNN is tiny. QCD jet samples in the study are leading p T jets of pp → Zj with invisibly decaying Z boson, and most of the samples are the quark jets. The difference in the N (0) distribution of the Higgs jets and QCD jets is small, and therefore, N (0) does not play an important role there.
The remaining gap between the ROC curves of RN S2,N (0) and CNN is almost filled by including N (1) in the analysis. As discussed in the previous section, the ratio N (1) /N (0) is a morphological measure that quantifies the level of clustering of the pixels. Therefore, N (1) is useful for distinguishing compact top jets from QCD jets whose number of pixels is the same. The similarity of two ROC curves indicates that the information summarized in the Minkowski functionals is used in the jet image analysis. Not only RN S2,N (0) ,N (1) gives a comparable result to CNN, but it is also significantly stable. We compare the softmax outputŷ 0 of a network N and the output of the same network trained with a different random seed, and we call the alternative outputŷ 0 . The change of the seed affects the shuffling of the events between batches and alters the initialization of the network. Since the training of the neural network is not a convex optimization in general, the network output difference figure 11, we show the histogram of two outputs (ŷ 0 ,ŷ 0 ) for RN S2,N (0) ,N (1) and CNN. The distribution for RN S2,N (0) ,N (1) is narrower than that for CNN. This shows that training of RN S2,N (0) ,N (1) is more stable.
The better training stability of RN is due to the difference in the inputs of the functional model. CNN uses the jet image itself, while RN uses the derived inputs from the jet image. CNN could approximate a wider variety of functions of jet constituents than RN. In other words, the training of CNN requires more effort in order to scan over larger space of functions. The training of a simpler model is much stable than that of a complex model because of less number of inputs and trainable parameters. A simpler model has a potential danger of underfitting, but it is less severe in RN because S 2,ab and N (i) are reasonable set for describing functional space of energy correlation and geometry of the jet constituents, respectively.
We The standard deviation of ∆ŷ 0 [CNN, RN] is larger than the error σ(∆ŷ 0 (CNN)) 2 + σ(∆ŷ(RN)) 2 , which is 0.091 for the top jet samples and 0.095 for the QCD jet samples. This indicates that the outputs of RN and CNN are highly correlated, but there are still some differences. We repeat the same analysis on non-typical events, which satisfies 0.15 <ŷ 0 < 0.85 for one of RN or CNN. The results are similar, but the standard deviations are larger by a factor 1.5 because we removed samples easy to classify.
In order to understand the cases on which RN S2,N (0) ,N (1) and CNN gives us extremely different This type of event is certainly not typical. The probability that the angle R bq or R bq is less than 0.2 is 5.6% without considering spin-correlation. For the right jet image, CNN judges that the jet is a QCD jet, but RN does not. It is a two-prong jet with many soft radiations and a small p T subjet from a quark due to longitudinal decay of W boson. In the longitudinal decay, one of the quark goes backward to the boost direction, but these jets suppressed in the phase space. Because both of the top jets are not typical, it is not surprising the two different models give very different results for those events.   We have checked if more aggressive training on these rare events improves the performance, for example, relaxing the regularizer setup. The AUCs of RN and CNN with the weight decay constant λ = 10 −4 are 0.9461 and 0.9465, respectively. There are tiny improvements in the classification performance, but it comes together with overfitting. The validation loss L(θ) and training loss L(θ) are 0.3044, and 0.2969 for RN; 0.3201, and 0.2971 for CNN, respectively. The L(θ) and L(θ) in the original setup are 0.3076, and 0.3049 for RN S2,N (0) ,N (1) ; 0.3338, and 0.3371 for the CNN. The difference between the training and validation loss is much bigger in λ = 10 −4 setup, which is a sign of overfitting.

Reweighting Distributions of IRC Unsafe Morphological Features
In section 4, we perform the analysis using sample generated by PY8, but in this section, we compare the result with the analysis using another event generator and discuss the systematic uncertainties associated with simulations. Because event generation involves the modeling of soft radiation, the generated events are model-dependent, and the simulator has to be tuned to experimental The question is how precisely these simulated events should agree with the data. For the analysis mainly using high p T objects, the effects of soft physics are small. On the other hand, a neural network based jet classifier trained on jet images are capable of utilizing the pattern of soft radiation. If the agreement between observed and simulated data are "sufficiently good", we could rely on the simulated data.
In reality, there are yet significant deviations between the MC predictions and experimental data, and the calibrations are necessary. Because we know that the less controlled IRC unsafe quantities, such as N (0) and N (1) , play an important role in the classification, we focus on calibrating the difference between the experimental and simulated data of those quantities. To see the systematical error coming from the mismodeling of the parton shower and hadronization, we perform the same classification analysis with different event generators and compare the results. We choose HW7 and PY8 for the comparison. The two event generators are quite different in modeling of the soft and collinear radiations. HW7 uses the angular-ordered shower [119] and the cluster hadronization model. [109,120]. 7 PY8 uses p T ordered-shower [121] and the string model of hadronization [122,123]. The comparison of the radiation pattern of QCD jets is available in various literature [66,124,125]. The prediction of the gluon jet distributions differs significantly in each simulator while it more or less agrees with each other for the quark jets. It is pointed out that prediction is sensitive to the color reconnection modeling.
In figure 14, we show the (N (0) , N (0) (4 GeV)) distributions of the QCD jet samples. The N (0) distribution simulated by PY8 is broader than that simulated by HW7. The tail of the N (0) distribution exceeds 60 for PY8, but it vanishes at there for HW7. On the other hand, the N (0) (4 GeV) distributions of PY8 and HW7 are similar, as shown in figure 10. The active pixels with p T > 4 GeV correspond to the particles from high p T partons in the shower. Predictions on those partons in the two generators tend to agree, and the predicted N (0) (4 GeV) distributions are also similar. The N (1) /N (0) distributions of PY8 and HW7 are also similar, as shown in figure 15. Therefore, N (0) should play an important role in the classification. The separation of the top jets and QCD jets is worse for HW7 compared with PY8 discussed in previous sections. The AUC of the top jet vs. QCD jet classification predicted by HW7 is smaller than that predicted by PY8. In figure 16, we show the ROC curves of each classifier trained on HW7 events. The performance of the RN S2 is similar to that trained on PY8 events. Once N (0) is additionally considered in the classification, the performance is improved. However, the improvement from adding N (0) is significantly small in HW7, because the N (0) distributions of top jets and QCD jets are close, as shown in figure 10.
In the previous analysis, we have shown that inputs N (0) , N (0) (4 GeV), N (1) , and N (1) (4 GeV) in addition to S 2,ab is good enough for building a neural network that fits the CNN output. At the same time, this indicates that tuning of the event generator focusing on these counting variables can be an efficient way to obtain the simulated data that gives consistent results with the experimental data.
If the difference between the simulated and experimental data is not too large, reweighting simulated events is useful for reducing the difference. We consider reweighting based on the marginal distribution of interested variables x. Let ρ true (x) and ρ MC (x) be the x distributions with true and simulated events, respectively. The new weight w new of the event i Y is given as follows, old is the weight before reweighting. Let us perform an exercise to correct (N (0) , N (0) (4 GeV)) distribution, assuming that either one of the distributions ρ PY8 and ρ HW7 simulated by PY8 and HW7 is ρ true while the other is ρ MC . We consider the reweighting of these two variables in order to consider a non-trivial case that some of the variables are correlated. The reweighting factor ρ true (N (0) , N (0) (4 GeV))/ρ MC (N (0) , N (0) (4 GeV)) is calculated using the normalized histogram of (N (0) , N (0) (4 GeV)), as stated in appendix E. The N (1) distribution in figure 17 still disagree after the reweighing, but the deviation is minor. Because the sample size is limited, we do not attempt to correct all those distribution in this paper. Figure 18 shows theŷ 0 distributions for QCD jet samples, of the models trained on PY8 samples. The orange dashed, black solid, and green dot-dashed histograms are theŷ 0 distributions with HW7, PY8, and reweighted HW7 samples, respectively. Figure 18(a) shows theŷ 0 distributions of RN S2 . This classifier does not use N (0) and N (0) (4 GeV) explicitly, but the distribution of reweighted HW7 samples comes quite close to that of PY8 samples. The score difference comes from the difference of S 2 distribution. The S 2,soft distribution of HW7 samples is 10% smaller than that of PY8 samples. The reweighting reduces the difference because N (0) and S 2,soft are correlated. The Pearson correlation coefficient between N (0) and S for both PY8 and HW7. 8 The bin-by-bin ratio of the average S (i) 2,soft between HW7 and PY8 samples is about 0.9. The average S (i) 2,soft of HW7 samples increases after the reweighting and the S 2,soft distributions get closer to each other. On the other hand, the reweighting increases the disagreement of p T,J distribution. The sum of the weights of the reweighted HW7 samples with p T,J ∼ 500 GeV is about 20% larger than that of PY8 samples. We marginalized p T,J during the training so that the impact on theŷ 0 distribution is minimal. Therefore, the agreement seen in figure 18(a) is mainly due to the correction of S 2,ab , and it is encouraging. Figure 18(b) shows theŷ 0 distributions of RN S2,N (0) ,N (0) (4 GeV) . Forŷ 0 ∼ 1, the ratio of theŷ 0 distributions of HW7 and PY8 exceeds 4, and the distribution of HW7 samples even peaks nearŷ 0 ∼ 1. This means that the model trained on PY8 samples focuses on a particular region in order to get high purity top samples, but the HW7 samples are still populated in the region. In the situation that the HW7 distribution is "true" while PY8 samples are used to build the top jet vs. QCD jet classifier, we overestimate the top quark event rate by dijet contamination; adding the variables whose "true distributions" are not well understood could cause the problem of this kind.
The ratio between weighted HW7 and PY8 distributions is constant. This is nice in order to avoid the systematics along with tightening the cut to reject QCD events. On the other hand, the ratio of the reweighted HW7 samples is much larger than that in figure 18(a). The deviation should come from the mismodeling of the correlation between N (0) and other parameters. The difference is even larger if one includes N (1) in the inputs, as shown in figure 18(c). The ratio between the weighted HW7 and PY8 sample is now nearly a factor of two larger at y ∼ 1 and even increasing. This disagreement is not surprising given the very poor sample of HW7 in the high N (0) region. Finally, figure 18(d) is theŷ 0 distribution of the CNN model. The distributions looks quite similar to those in figure 18(c) before reweighting, but the ratio of the distributions of reweighted HW7 events and PY8 events is larger than that of RN S2,N (0) ,N (1) . Figure 19 shows theŷ 0 distributions of the model RN S2,N (0) ,N (0) (4 GeV) trained on HW7 events. Recall that the QCD jets in PY8 samples cover the phase space of the QCD jets simulated by HW7, and the reweighting is then effective for transforming the PY8 samples to HW7 samples. The opposite is not true because there are QCD jets which are not in HW7 generated samples. The reweighting is not exact because we have only a small number of events in some phase space region, and we see some deviation inŷ distribution, as shown in figure 18(b). If one wishes to describe real data by assigning an appropriate weight for each simulated events, it is better to use a generator setup that covers wider phase space so that we can correct the event distribution by using experimental data afterwords.  Figure 19: The score distribution of PY8 and HW7 test sample for the model trained by the HW7 events.
The disagreement between PY8 and HW7 samples remains after the reweighting in this exercise. We do not proceed to reweight distributions other than the (N (0) , N (0) (4 GeV)) distribution in this paper because of the statistical limitation. Neural network-based reweighing [126] can be helpful for adjusting full phase-space, but it is beyond the scope of this paper. The difference between the two generators is too large to achieve perfect agreement simply by reweighting. Because N (0) and N (1) are important quantities for describing the neural network-based classifier, those generators may be tuned carefully to reproduce the distribution of soft activities in jet images.

Discussions
In this paper, we have identified essential quantities that the CNN on a jet image is using for the top jet vs. QCD jet classification. The discovered quantities consist of both IRC safe and IRC unsafe observables. The former includes the IRC safe two-point energy correlation, jet spectrum, as a function of the distance between two jet constituents. The latter is an IRC unsafe Minkowski sequence inspired from the Minkowski functionals that describes morphological information on the set of jet constituents. It gives a quantitative measure of the area that is occupied by the particles inside jets. The first element of the Minkowski sequence is the number of active pixels in the jet image, N (0) , and the second element N (1) is the sum of the N (0) and the number of the pixels adjacent to the active pixels. These quantities are derivable from a jet image, and the relation network (RN) trained on these quantities (along with kinematic observables) has equivalent performance to the CNN.
The IRC safe quantities are theoretically more controlled, especially different event simulators predict consistent distributions. On the other hand, the IRC unsafe quantities are described by phenomenological models tuned by the experimental data. The classification performance of RN agrees with that of CNN only when we include IRC unsafe Minkowski sequences among the inputs. The similarity of the performance indicates that the top jet classifier based on CNN uses the geometric information of soft radiation, and we have succeeded in reproducing the CNN predictions using fewer degrees of freedom.
We also point out that the training of the RN is more stable than the CNN. The stability comes from the fact that the RN classifiers use a restricted set of derived inputs from the jet images, and the loss function of the RN is less complicated than that of CNN. We measure the variation of the training results by randomly swapping the event orders in the batch training and using a different initial parameter in the networks. The variation of the RN output is about factor 3 smaller than that of the CNN output, as we have seen in table 2.
As the IRC safe inputs, we choose the jet spectrum [17,19], which is aggregated two-point energy correlation as the function of the ∆R. We introduce the various improvements on the jet spectrum from the previous paper. In this paper, it is derived from a constrained graph network. A vertex of the graph network corresponds to a jet constituent, while each vertex carries information of the constituent momentum and the subjet ID to which the constituent belongs. The edges links between two vertices and represent the two-point correlation between the two constituents. For the classification of top jets and QCD jets, we find that the correlation among the trimmed jet and the correlation between the leading subjet and the other constituents, and their geometry are especially useful in the classification. We systematically include the three-point structure of the top quark in the two-point energy correlations after removing the leading subjet. The modularized networks process the two-point correlations separately with global kinematical inputs so that the combined network accepts significantly more inputs without inflating the parameters in the hidden layers.
The classifiers using the IRC unsafe quantities, such as soft pixels of jet image or the Minkowski sequence, could suffer from systematic uncertainties of the simulation. After the identification of the key morphological quantities, we can minimize efforts on calibration by focusing on the N (0) and N (1) distributions. The distributions may be corrected relatively easily by reweighting events to calibrate the distributions to the observed data. We demonstrate that the reweighting of the simulated events to reproduce the true N (0) distributions greatly reduce the systematic error of the classifiers. Such tuning of the data reduces the systematic uncertainties in the ML classifications that depend on the simulated events.
In summary, we propose an approach to replace a complex neural network using the low level inputs into a simple network using the processed inputs motivated from a physics point of view. To this end, we show surprising evidence that the CNN output depends on the geometrical measures expressed by discritized version of the Minkowski functionals. These morphological quantities improve jet classification significantly. The study of jet morphology from the data, and comparison to the prediction from event simulation might be an exciting direction to persuade. We think the variables may be further extended not only for jet physics but also for the analysis of event geometry or anomaly searches.

A Setup for Monte-Carlo Event Simulation
We generate pp → tt and pp → jj events for top jet and QCD jet samples, respectively. The symbol j represents gluon or (anti-)quark other than the top quark. The parton level events are generated by Madgraph5 2.6.6 [106]. The center of mass energy is 13 TeV. Produced top quarks are forced to decay into bW and the subsequent W boson decays into two quarks including b-quarks. Since we are only interested in boosted top quarks, we generate events with outgoing partons whose p T is larger than 450 GeV. Numbers of the generated pp → tt and pp → jj events with this preselection are 5 million and 10 million, respectively. The renormalization and factorization scales are set to be H T /2, where H T is the sum of the transverse energy of each parton, and the parton distribution function is NNPDF23 lo as 0130 qed [127][128][129][130]. Two parton shower and hadronization simulations are considered in this paper: Pythia 8.226 [107] with Monash tune [131] and Herwig 7.1.3 [108,109] with default tune [132,133]. The pile-ups are not included but the underlying events and multi-parton interactions are considered.
We use Delphes 3.4.1 [110] for detector simulation with its default ATLAS detector configuration. Jets are reconstructed from calorimeter towers whose (η, φ) resolutions at electromagnetic and hadronic calorimeters in |η| < 2.5 are assumed to be (0.1, 10 • ) and (0.0174,1 • ), respectively. Anti-k T jet clustering algorithm [134] with radius parameter R J = 1.0 implemented in fastjet 3.3.0 [135,136] is used to cluster these calorimeter towers into jets. The leading p T jets with its transverse momentum p T,J ∈ [500, 600] GeV and mass m J ∈ [150, 200] GeV are selected for the analysis. In addition, a top jet sample is required to have quarks from the originating top quark within R J from the jet axis. After this selection, we have about 950,000 top jets and 350,000 QCD jets. Half of them are used for the training and th others are used for testing. For jet trimming, we use k T algorithm [137,138] with radius 0.2 and keep subjets whose energy fraction is larger than 0.05. The leading p T subjet J 1 is the highest p T anti-k T subjet [134] with radius 0.2.
Note that we have not used matched sample, so that the modeled p T,J distribution is not precise beyond the leading order accuracy. Nevertheless, the changes due to recoiling from extra radiation are not a main interest in this paper, so we use this samples by presuming that the top jets and QCD jets are factorizable.

B Kernel Density Estimation of p T,J Distribution
We use the kernel density estimation (KDE) on a finite interval [p min T,J , p max T,J ] to model the event-byevent weight f p T ,J (p T,J ; Y ) in eq. (4.13). First, we transform p T jet into a logit t(p T,J ) in order to make the domain unbounded.
The KDE of the sampled logits, t(p T,J ) is used to estimate the probability density function f P T ,J (p T,J ; Y ).
T,J )), t (p T,J ) = where K h is a scaled kernel whose bandwith parameter is h. In particular, a gaussian kernel with bandwith h = 0.25 is used for the KDE.
However, t (p T,J ) is singular at p min T,J and p max T,J , and the estimation of the probability density near the boundary is less precise. Instead of using samples after the selection p T,J ∈ [500, 600] GeV, we use a selection with broader p T,J range, [450, 650] GeV for KDE only in order to avoid the effects from the singularities. The KDE is then normalized for p T,J ∈ [500, 600] GeV afterward. We show the normalized histogram of p T,J and the KDE in figure 20.

C.1 Relation Network
The relation networks used in this paper are implemented as follows. The module for analyzing the energy correlation with jet trimming, h trim = MLP trim (x trim , x kin ), consists of two hidden layers, where z i is the standardized inputs of x i , and FC is a fully-connected layer with a given output size and activation function. Note that we do not apply L 2 regularization for the FCs with linear activation. The module for analyzing the energy correlation of J 1 and J \ J 1 is as follows. The logits u for the binary classification is implemented as follows. For the relation networks with inputs x geometry , we change eq. (C.7) as follows.

C.2 Convolutional Neural Network
The convolutional neural network for the jet image analysis consists of six convolutional layers with a filter size 3 × 3. The regularized jet images are obtained as in [19], and the energy deposit of each pixel is standardized. The standardized p T map z image of x image is fed into a chain of convolutional layers as follows. where CONV is a two-dimensional convolutional layer with a given filter size and activation function, and POOL is a max-pooling layer with a given pool size. Again, we simply put h CNN to MLP logit by replacing eq. (C.7) as follows.

D Updating Trainable Parameters with Moving Averages
The moving average of a network parameter in section 4.1 is evaluated as follows. An updated parameter θ (t) at an epoch t is accumulated into a moving averageθ (t) , where α = 0.9. We accumulate only the updated parameters at the epochs after t 0 = 50. The solution to the recurrence relation is as follows, As a side effect of the epoch selection, the sum of the weights in the average is not 1. As θ (u) approaches its optimum θ 0 ,θ (t) approaches to (1 − α t−t0+1 )θ 0 . The factor 1 − α t−t0+1 should be corrected to make the moving average also converging to θ 0 . We use the following unbiased moving averageθ (t) of the sequence of θ (t) for the validation and testing, 3)

E Evaluation of the Reweighting Factor
In section 5, we reweight the HW7 generated events to PY8 generated events by using (N (0) , N (0) (4 GeV)) distribution. Since the two numbers are correlated as shown in figure 14, we transform the data first and calculate the reweighting factor using normalized histograms in order to ensure the efficiency of the reweighting. The transformation of (N (0) , N (0) (4 GeV)) is defined as follows. where c 1 = 3/2, c 2 = 2, and c 3 = −1/60. For each event, the reweighting factor in eq. (5.1) is calculated by the ratio of the corresponding bin values, ρ PY8 /ρ HW7 , where ρ A is the bin value of (x , y ) histogram with events generated by A. The reweighting factor for PY8 generated events to obtain distributions of HW7 generated events can be obtained by a similar procedure.