Infrared Safety of a Neural-Net Top Tagging Algorithm

Neural network-based algorithms provide a promising approach to jet classification problems, such as boosted top jet tagging. To date, NN-based top taggers demonstrated excellent performance in Monte Carlo studies. In this paper, we construct a top-jet tagger based on a Convolutional Neural Network (CNN), and apply it to parton-level boosted top samples, with and without an additional gluon in the final state. We show that the jet observable defined by the CNN obeys the canonical definition of infrared safety: it is unaffected by the presence of the extra gluon, as long as it is soft or collinear with one of the quarks. Our results indicate that the CNN tagger is robust with respect to possible mis-modeling of soft and collinear final-state radiation by Monte Carlo generators.


I. INTRODUCTION
Events at the Large Hadron Collider (LHC) contain large numbers of jets.The jets can be classified into four types, according to their origin: (i) Light-quark jets, initiated by u, d, s or c quarks; (ii) Gluon jets; (iii) b-quark initiated jets; and (iv) jets created by a hadronic decay of a highly boosted massive object, such as a W/Z boson, Higgs, or top quark.In the latter case, hadronic showers created by each of the partons overlap, and standard jet reconstruction algorithms recognize them as a single merged jet.A jet classification algorithm, or "tagger", attempts to reconstruct the origin of each individual jet, based on the information accessible to the experiment, i.e. detector-level data.Recently, there has been strong interest in applying modern machine-learning techniques, such as Neural Networks (NNs), to the jet classification problem.This is motivated as follows.The pattern of energy deposits in individual hadron calorimeter (HCAL) cells can be thought of as a two-dimensional image of the jet.Jets of each type have a characteristic shower history, resulting in differences in spatial distribution of energy inside the jet, often called "jet substructure".The jet classification problem is thus mapped onto a 2D image recognition problem [1].Application of NNs to image recognition is a well-developed field of computer science.Advanced NN-based image recognition techniques have been applied to jet classification problems in Monte Carlo (MC) studies, with highly promising results [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19] (for a review, see [20]).For example, NN-based top taggers have been shown to outperform traditional top-tagging algorithms currently in use by the LHC experiments.
Can NN-based taggers trained on MC samples be used in real data analysis?The answer hinges on whether the features of jet substructure that are identified by the NN as important for classification are in fact accurately modeled by the MC generator.This is a non-trivial issue.Parton showering cannot be described by fixed-order perturbation theory, since soft and collinear parton splittings suffer from infrared (IR) singularities.As a result, MC predictions of energy distribution within jets, in particular on small angular scales, suffer from significant (and poorly quantified) theoretical uncertainties.At the same time, unlike traditional taggers, the highly non-linear, multi-variable nature of the NN tagger output makes it very difficult to identify the specific features in the jet substructure that the NN focuses on, let alone assess their robustness in the simulation.To date, this issue has been addressed by cross-comparisons of NN taggers trained on samples produced by different MC generators, which employ different algorithms to model parton showers (see e.g.Refs.[2,5]).While the results seem to indicate that the NN output is robust, a deeper understanding of this issue is clearly desirable to put this approach to jet tagging on a firm foundation [21].
Traditionally, observables in jet physics are thought to be robust with respect to uncertainties in parton shower modeling if they satisfy the requirement of Infrared (IR) Safety.The notion of IR safety applies to parton-level events.An observable O is IR safe if a soft or collinear splitting of one of the partons leaves O unchanged: On(p1, . . ., pi, pi+1, . . ., pn) → On−1(p1, . . ., pi + pi+1, . . ., pn) whenever p i+1 becomes soft or collinear with p i .For arXiv:1806.01263v1 [hep-ph] 4 Jun 2018 example, consider the two events shown in Fig. 1.In the limit when gluon in Fig. 1 (b) becomes either soft (p T,g → 0) or collinear with one of the quarks (p g • p i → 0), the value of O evaluated on the final state (b) should approach its value evaluated on the final state (a).NN tagger is an observable that maps the matrix of energy deposits in individual HCAL cells onto a number between 0 and 1, the "topness" of the jet.The goal of this paper is to check whether this observable is IR safe.We perform this test in the particular context of a Convolutional Neural Network (CNN) top tagger.The CNN is first trained on particle-level (showered and hadronized) MC samples of boosted top jets and "QCD" (light quark/gluon) jets.We then apply this CNN to parton-level hadronic top events.This defines a partonlevel observable, to which the above canonical definition of IR safety can be applied.We study the behavior of this observable as a function of the gluon momentum and collinearity in the (t → 3q) + g sample shown Fig. 1 (b).Our numerical results strongly support the hypothesis that the CNN output is IR-safe.The training process appears to result in a network that largely disregards small-scale angular features in the energy distribution inside the jet, making the CNN tagger robust with respect to modeling such small-scale features in MC generators.Such robustness is a necessary pre-condition for practical applicability of MC-trained NN taggers, and it is highly reassuring that it is satisfied.
All image recognition algorithms have to confront issues of image quality and distortions from various sources such as image blur and noise.To be useful in the real world, an algorithm must be robust with respect to such distortions.The issue of IR safety is similar.The CNN tagger essentially ignores small energy deposits and very small-angle features of the energy distribution, at least for the top sample.Interestingly, this feature of the tagger did not need to be engineered, but rather emerged automatically from the standard training process.
The rest of the paper is organized as follows.In Section II, we discuss the architecture of the CNN tagger, its training and performance on particle-level MC samples.We also describe the parton-level "merged" and "unmerged" samples used for numerical tests of IR safety of the tagger.The main results of the analysis are presented in Section III, which contains the evidence to support our claim that the CNN observable is IR-safe according to the canonical definition, Eq. ( 1). Discussion of the results and conclusions are contained in Section IV.

II. NEURAL NET TAGGER AND EVENT SAMPLES
The top tagger used in this study consists of a Convolutional Neural Network (CNN), which has proven to be one of the best performers in problems of pixelated image recognition.CNN architecture is known to produce robust identification of translationally invariant features of a priori unknown size.In our case, the feature of interest is subjets, which clearly play an important role in top tagging.The network architechture is schematically shown in Fig. 2. We used the mxnet software package for implementation of the CNN [22] on a NVIDIA Geforce GTX 1080 GPU.The input layer of the CNN is the Hadronic Calorimeter (HCAL), modeled as a set of 30 × 30 square pixels of size (∆φ, ∆η) = (0.1, 0.1).The pixels are populated by normalized energy deposited in each bin by a jet, preprocessed according to the procedure used in Ref. [2].Preprocessing places the center of the jet at the center of the HCAL image, and rotates the jet so that the principal axis always has the same orientation.In this way, overall translational and rotational symmetries of the jet are factored out and do not need to be learned in the training process [23].The next layer consists of thirty "filters", which are convoluted with the input image according to where I is the input image, w are the filters, and o is the output of the convolution operation.The individual weights of the filters, w (x, y) are determined during training of the CNN, using back-propagation methods.
Each filter is to learn some distinguishing features that separate signal from backgrounds.The outputs of these layers are subsampled and convoluted further with different set of filters, as shown in Fig. 2. Ultimately, the final fully connected layer produces a single output, the "topness" of the jet Y ∈ [0, 1], with Y = 1 corresponding to a boosted top jet and Y = 0 corresponding to a QCD jet.
To train the network, we use the MC samples of particle-level top and QCD jets that were previously used in Ref. [2].In particular, we use a sample of jets reconstructed using anti-k T algorithm with a large jet cone of R = 1.0, with p T in 800 − 900 GeV range.We further require that the jet mass (m J ) be in the range of 130 − 210 GeV.A majority of top-jets would fall in this mass range, while most QCD jets would be rejected by the m J requirement.The top and QCD jets passing these basic cuts are preprocessed as described in [2], and provided as inputs to the CNN.For training, weights were initialized using Xavier initialization, and Adam hyperparameter optimization with learning rate of 0.05 and dropout regularization rate of 0.0001 was used.100 epochs through the training samples were made using minibatches of 1000 events.After training, the CNN performance was evaluated using the test samples, generated and preprocessed in the same way as the training samples.The results are shown in Fig. 3.The results are not very sensitive on the choices of hyperparameters during training.The CNN output provides a clear separation between the two types of jets.For interesting top tagging efficiencies, the mistag rate is reduced by a factor of almost 2 compared to traditional observables, providing further improvement upon the two hidden layer multilayer perceptron type deep neural network used in [2].
As explained in the Introduction, the notion of IR safety applies to observables defined on parton-level events.The observable we want to study is the output of the CNN, the "jet topness" Y ∈ [0, 1].The CNN maps a set of energy deposits in HCAL cells into this observable: I(i, j) → Y .A parton-level event of the type shown in Fig. 1 is trivially mapped into I(i, j) by identifying each parton's location in the (η, φ) space, and assigning the value of that parton's energy to the corresponding HCAL cell.This defines the action of the CNN on parton-level events, which can be thought of as a map where p i is the parton 4-momenta, and N p is the number of partons in the event.We would like to study whether the IR safety criterion, Eq. ( 1), applies to this map.While O is a completely well-defined function, it is horrendously complicated and highly non-linear, making an analytic study of its limits impractical.Instead, we will check the IR safety criterion numerically.To this end, we used MadGraph [24] to generate a parton-level sample of hadronically decaying top quarks, with an additional gluon in the final state, as in Fig. 1 (b).(To avoid unnecessary complexity, we simulate a process with no other colored particles in the final state.)In this simulation, cuts on the gluon momentum and its separation from each quark must be imposed to avoid infinities associated with soft and collinear singularities.Since we are primarily interested in precisely the gluons in the soft and collinear regions, the cuts we impose are very low: p T ≥ 5 GeV, ∆R qg ≥ 0.05.One may question whether a fixed-order simulation correctly approximates the cross section for such low values of p T and ∆R qg .For our purposes, however, this question is irrelevant.We want to study how the CNN response is affected by the presence of a soft or collinear gluon, and the purpose of the simulation is simply to provide a sample of such events; we do not use any information about their overall cross section or phase-space distribution.
To ensure that the CNN is applied in the same regime where it was trained, we compute the "jet p T " (the sum of the four parton p T 's) and the "jet invariant mass" (the total invariant mass of the four partons) for each event, and apply the same cuts as in the training sample, p T ∈ [800 − 900] GeV, m J ∈ [130 − 210] GeV.The sample constructed in this way is referred to as the unmerged sample.We construct the merged sample by taking each event in the unmerged sample, identifying the quark closest to the gluon (in terms of ∆R ij separation), and replacing that quark and the gluon with a single parton with 4-momentum equal to the sum of the two.Applying the CNN map to the unmerged and merged samples corresponds to evaluating the left-hand side and the right-hand side of Eq. ( 1), respectively.Checking the IR safety criterion then amounts to comparing the CNN outputs on these two samples.Training the CNN on particle-level top and QCD samples and applying it to the parton-level top sample produces the output distribution shown in Fig. 4. Clearly, the network predominantly still perceives such events as top-like, indicating that details of parton shower pattern are not crucially important for recognizing an event as top-like.This already provides some evidence that the observable defined by the CNN is likely IR-safe.In the rest of this section, we will attempt to establish the IR safety more directly, by comparing CNN outputs on merged and unmerged samples as explained above.
To gauge the impact of soft/collinear gluon radiation, we compute the difference ∆ N N between the CNN output from an event in the unmerged sample and the corresponding event in the merged sample.A convenient measure of soft/collinear kinematics of the gluon is provided by its "relative p T ", defined by where p q is the 3-momentum of the quark nearest (in terms of ∆R qg separation) to the gluon.Physically, p g T is the component of the gluon 3-momentum transverse to the nearest quark, and it vanishes in both soft and collinear limits.If the CNN observable is IR safe, we expect ∆ N N to go to zero in the limit of vanishing relative p T .The distribution of |∆ N N | and p g T values in our event sample is shown in the left panel of Fig. 5, where each blue dot corresponds to an individual event.For most events, |∆ N N | is small, which is reassuring: adding a soft gluon does not lead to a dramatic change in the CNN output.There is, however, a tail of events where the change is significant.To better characterize this tail, we bin the data in relative p T and calculate the width of the |∆ N N | distribution in each bin.The width |∆ N N | 90 for each bin is defined by requiring that 90% of the events in that bin have are plotted as red dots in Fig. 5.The data exhibits a clear correlation between decreasing relative p T and decreasing width, indicative of IR safety.In fact, the data is consistent with the hypothesis that |∆ N N | 90 → 0 in the limit of p g T → 0. In the right panel of Fig. 5, the data is further subdivided into 10 bins according to the NN output evaluated on the merged sample, and dependence of the width on relative p T is shown separately for each bin [25].For events in the last bin, 0.9 ≤ Y ≤ 1, emission of an extra gluon has almost no effect even if it has a relatively large relative p T .This is presumably due to the fact that Y is already close to the upper boundary.The events in this bin are therefore consistent with the IR safety hypothesis, but do not show much variation as relative p T is varied.On the other hand, events in all other Y bins show a very clear convergence between the output values with and without the extra gluon in the p g T → 0 limit.The relative p T observable goes to zero in both soft and collinear limits.It is interesting to probe the convergence of the CNN output in each of these limits separately.To this end, we study two observables.The first one is the angular separation between the gluon and the nearest quark, ∆R qg , which goes to zero in the collinear limit, but not the soft limit.The second one is the "longitudinal momentum ratio", defined by where p q is the 3-momentum of the quark nearest (in terms of ∆R qg separation) to the gluon.This observable vanishes when the gluon is soft, but not when it is collinear with one of the quarks.The difference in CNN outputs for merged and unmerged samples as a function of these two observables is shown in Figs. 6 and 7. We conclude that the convergence of the outputs holds separately in both soft and collinear limits.By construction, events in which the extra gluon lands in the same HCAL cell as its nearest quark will have ∆ N N = 0.This feature makes the CNN observable automatically IR-safe in the limit of small ∆R.Is the observed IR safety in this limit due entirely to this feature?To address this question, we repeated the analysis on a sample in which events where the gluon and a quark are in the same HCAL cell have been removed.The result is shown in Fig. 8. Convergence of CNN outputs in the ∆R → 0 limit persists in this sample.This indicates that the CNN output for a sample with an extra gluon converges smoothly as the gluon approaches its nearest quark, even if they do not land in the same cell.Such convergence is an intrinsic feature of CNN's treatment of energy patterns, and not just a trivial consequence of finite cell size.

IV. DISCUSSION
Starting with Ref. [2], many studies have demonstrated the efficacy of Neural Networks for boosted top jet tagging, at the level of Monte Carlo (MC) simulations.All studies to date have trained and evaluated NN top taggers using particle-level MC samples of top and QCD jets.In this paper, a Convolutional Neural Network (CNN) top jet tagger was constructed.While particle-level MC samples were used in training as usual, we then applied the resulting tagger to a sample of parton-level top events with and without an additional gluon in the final state, as shown in Fig. 1.We showed that this observable obeys the Infrared Safety criterion: The output of the CNN applied to an event with an extra gluon approaches its output on the same event without an extra gluon, in the limit when the extra gluon becomes soft or collinear with one of the quarks.
Our analysis does not constitute a complete proof of IR safety of the CNN output.The reason is that we studied only one class of final states, those containing a highp T top quark.In the language of Eq. ( 1), our analysis demonstrates IR safety for some configurations of finalstate parton momenta p i , but not for general p i .We did not study the behavior of the CNN output on nontop events, mainly because there is a very broad range of possible momentum configurations, making a comprehensive numerical study impractical.Moreover, it is not immediately clear which of these configurations are most important for the top tagging problem.We hope to be able to address this issue in future work.
In spite of this limitation, the results of the analysis presented in this paper are highly reassuring.Certainly, the value of NN-based approach to top jet tagging would be in very serious doubt if an addition of a soft or collinear parton to the final state led to an order-one change in the NN output.We showed that this is not the case for the events with a genuine high-p T top quark in the final state.This result places the NN-based taggers on firmer foundation, and should provide encouragement for further development of this approach.

FIG. 5 .
FIG. 5. Left panel: Difference in CNN output between merged and unmerged events, |∆NN |, as a function of the gluon transverse momentum relative to its nearest quark, p g T .Red dots show the width of the |∆NN | distribution.Right panel: |∆NN | width as a function of p g T , shown separately for 10 NN output bins.

1 <FIG. 7 .
FIG. 6. Left panel: Difference in CNN output between merged and unmerged events, |∆NN |, as a function of the gluon's angular separation from its nearest quark, ∆Rqg.Red dots show the width of the |∆NN | distribution.Right panel: |∆NN | width as a function of ∆Rqg, binned in 10 NN output intervals.

1 <
FIG.8.Same distributions as in Fig.6.Events where the gluon and its nearest quark are in the same HCAL cell have been removed from the sample.