Deep-learning Top Taggers or The End of QCD?

Machine learning based on convolutional neural networks can be used to study jet images from the LHC. Top tagging in fat jets offers a well-defined framework to establish our DeepTop approach and compare its performance to QCD-based top taggers. We first optimize a network architecture to identify top quarks in Monte Carlo simulations of the Standard Model production channel. Using standard fat jets we then compare its performance to a multivariate QCD-based top tagger. We find that both approaches lead to comparable performance, establishing convolutional networks as a promising new approach for multivariate hypothesis-based top tagging.


Introduction
Geometrically large, so-called fat jets have proven to be an exciting as well as useful new analysis direction for many LHC Run I and Run II analyses. Their jet substructure allows us to search for hadronic decays for example of Higgs bosons [1], weak gauge bosons [2], or top quarks [3][4][5][6][7][8] in a shower evolution otherwise described by QCD radiation [9]. Given this success, a straightforward question to ask is whether we can analyze the same jet substructure patterns without relying on advanced QCD algorithms. An example of such an approach are wavelets, describing patterns of hadronic weak boson decays [10,11]. Even more generally, we can apply image recognition techniques to the two-dimensional azimuthal angle vs rapidity plane, for example searching for hadronic decays of weak bosons [12][13][14][15] or top quarks [16]. The same techniques can be applied to separate quark-like and gluon-like jets [17].
Many of the available machine learning applications in jet physics have in common that we do not have an established, well-performing QCD approach to compare to. Instead, machine learning techniques are motivated by their potential to actually make such analyses possible. Our study focuses on the question of how machine learning compares to state-ofthe-art top taggers in a well-defined fat jet environment, i. e. highly successful QCD-based tagging approaches established at the LHC. Such a study allows us to answer the question if QCD-based taggers have a future in hadron collider physics at all, or if they should and will eventually be replaced with simple pattern recognition.
On the machine learning side we will use algorithms known as convolutional neural networks [18]. Such deep learning techniques are routinely used in computer vision, targeting image or face recognition, as well as in natural language processing. In jet physics the basic idea is to view the azimuthal angle vs rapidity plane with calorimeter entries as a sparsely filled image, where the filled pixels correspond to the calorimeter cells with non-zero energy deposits and the pixel intensities to the deposited energy. After some image pre-processing, a training sample of signal and background images can be fed through a convolutional network, designed to learn the signal-like and background-like features of these jet images [13,17]. The final layer of the network converts the learned features of the image into a probability of it being either signal or background. The performance can be expressed in terms of its receiver operator characteristic (ROC), in complete analogy to multivariate top tagger analyses [19,20].

Multivariate analysis tools
Top tagging is a typical (binary) classification problem. Given a set of variables {x i } we predict a signal or background label y. In general, we train a classifier on a data set with known labels and then test the performance of the classifier on a different data set.
Rectangular cuts are sufficient if the variables x i contain orthogonal information and the signal region variable space is simply connected. A decision tree as a classifier is especially useful if there are several disconnected signal regions, or if the shape of the signal region is not a simple box. The classification is based on a sequence of cuts to separate signal from background events, where each criterion depends on the previous decision. Boosting sequentially trains a collection of decision trees, where in each step the training data is reweighted according to the result of the previous classifier. The final classification of the boosted decision tree (BDT) is based on the vote of all classifiers, and leads to an increased performance and more stable results. BDTs are part of the standard LHC toolbox, including modern top taggers [20]. We will use them for the QCD-based taggers in our comparison.
Artificial Neural Networks (ANN) mimic sets of connected neurons. Each neuron (node) combines its inputs linearly, including biases, and yields an output based on a nonlinear activation function. The usual implementation are feed-forward networks, where the input for a node is given by a subset of the outputs of the nodes in the previous layer. Here, nodes in the first layer work on the {x i }, and the last layer performs the classification. The internal layers are referred to as hidden layers. The internal parameters of a network, i. e. the weights and biases of the nodes, are obtained by minimizing a so-called cost or loss function. Artificial networks with more than one or two hidden layers are referred to as deep neural networks (DNN). ANNs and DNNs are frequently used in LHC analyses [21].
In image and pattern recognition convolutional networks (ConvNets) have shown impressive results. Their main feature is the structure of the input, where for example in a two-dimensional image the information of neighboring pixels should be correlated. If we attempt to extract features in the image with standard DNN and fully connected neurons in each layer to all pixels, the construction scales poorly with the dimensionality of the image. Alternatively, we can first convolute the pixels with linear kernels or filters. The convoluted images are referred to as feature maps. On all pixels of the feature map we can apply a non-linear activation function, such that the feature maps serve as input for further convolution layers, where the kernels mix information from all input feature maps. After the last convolution step, the pixels of the feature maps are fed to a standard DNN. While the convolution layers allow for the identification of features in the image, the actual classification is performed by the DNN. While an arbitrarily large non-convolutional DNN should be able to learn features in the image directly, the convolution layers lead to much faster convergence of the model. Image recognition in terms of ConvNets has only recently been tested for LHC applications [13,17]. The machine learning side of our comparison will be based on ConvNets.

Image recognition
Image recognition includes many operations, which we will briefly review in this section. The convolutional neural network starts from a two-dimensional input image and identifying characteristic patterns using a stack of convolutional layers. We use a set of standard operations, starting from the n × n image input I: We artificially increase the image by adding zeros at all boundaries in order to remove dependence on non-trivial boundary conditions, To identify features in an n × n image or feature map we linearly convolute the input with n c-kernel kernels of size n c-size × n c-size . If in the previous step there are n c-kernel > 1 layers, the kernels are moved over all input layers. For each kernel this defines a feature map F k which mixes information from all input layers This non-linear element allows us to create more complex features. A common choice is the rectified linear activation function (ReLU) which sets pixel with negative values to zero, f act (x) = max(0, x). In this case we define for example Instead of introducing an additional unit performing the activation, it can also be considered as part of the previous layer.
-Pooling: (n × n) → (n/p × n/p) We can reduce the size of the feature map by dividing the input into patches of fixed size p × p (sub-sampling) and assign a single value to each patch MaxPooling returns the maximum value of the subsample f pool (F ) = max patch (F ij ).
A convolutional layer consists of a ZeroPadding, Convolution, and Activation step each. We then combine n c-layer of these layers, followed by a pooling step, into a block. Each of our n c-block blocks therefore works with essentially the same size of the feature maps, while the pooling step between the blocks strongly reduces the size of the feature maps. This ConvNet setup efficiently identifies structures in two-dimensional jet images, encoded in a set of kernels W transforming the original picture into a feature map. In a second step of our analysis the ConvNet output constitutes the input of a fully connected DNN, which translates the feature map into an output label y: While the ConvNet uses two-dimensional inputs and produces a set of corresponding feature maps, the actual classification is done by a DNN in one dimension. The transition between the formats reads x = (F 11 , . . . , F 1n , . . . , F n1 , . . . , F nn ) .
(1.5) -Fully connected (dense) layers: n 2 → n d-node The output of a standard DNN is the weighted sum of all inputs, including a bias, passed through an activation function. Using rectified linear activation it reads For the last layer we apply a specific SoftMax activation function . (1.7) It ensures y i ∈ [0, 1], so the label can be interpreted as a signal or background probability.
In a third step we define a cost or loss function, which we use to train our network to a training data set. For a fixed architecture a parameter point θ is given by the ConvNet weights W kl rs defined in Eq.(1.2) combined with the DNN weights W ij and biases b i defined in Eq.(1.6). We minimize the the mean squared error where y(θ; x i ) is the predicted binary label of the input x i and y i is its true value. This choice of loss function does not optimize the learning performance or the probabalistic information, but it will work fine for our purpose. Eventually, it could for example be replaced by the cross entropy. For a given parameter point θ we compute the gradient of the loss function L(θ) and first shift the parameter point from θ n to θ n+1 in the direction of the gradient ∇L(θ n ). In addition, we can include the direction of the previous change such that the combined shift in parameter space is The learning rate η L determines the step size and can be chosen to decay with each step (decay rate). The parameter α, referred to as momentum, dampens the effect of rapidly changing gradients and improves convergence. The Nesterov algorithm changes the point of evaluation of the gradient to Each training step (epoch) uses the full set of training events.

Machine learning setup
The goal of our analysis is to determine if a machine learning approach to top tagging at the LHC offers a significant advantage over the established QCD-based taggers, and to understand the learning pattern of such a convolutional network. To reliably answer this question we build a flexible neural network setup, define an appropriate interface with LHC data through the jet images, and optimize the ConvNet/DNN architecture and parameters for top tagging in fat jets. To build our neural network we use the Python package Theano [22], with a Keras front-end [23]. An optimized speed or CPU usage is not part of our performance study.

Jet images and pre-processing
The basis of our study are calorimeter images, which we produce using standard Monte Carlo simulations -obviously, in an actual application they should come from data. In recent years, many strategies have been developed to define appropriate signal samples which allow us to benchmark top taggers. Typically, they rely on top pair production with an identified leptonic top decay recoiling against a top jet. The lepton kinematics can then be used to estimate the transverse momentum of the hadronically decaying top quark. We simulate a 14 TeV LHC hadronic tt sample and a QCD dijet sample with Pythia8 [24], ignoring multiple interactions. While one could clearly include pile-up in the simulations, understanding and removing it requires information beyond the calorimeter images, for examples from tracks. For our early study we do not include track information in our jet image, some ideas in this direction are pointed out in Ref. [17] All events are passed through a fast detector simulation with Delphes3 [25] with calorimeter towers of size ∆η × ∆φ = 0.1 × 5 • and a threshold of 1 GeV. We cluster these towers with FastJet3 [26] to R = 1.5 anti-k T [27] jets with |η| < 1.0. These anti-k T jets give us a smooth outer shape of the fat jet and a well-defined jet area for our jet image. To ensure that the jet substructure in the jet image is consistent with QCD we re-cluster the anti-k T jet constituents with an R = 1.5 C/A jet [28]. Its substructures define the actual jet image. When we identify calorimeter towers with pixels of the jet image, it is not clear whether the information should be the energy E or only its transverse component E T .
Our fat jets have to fulfill |η fat | < 1.0, to guarantee that they are entirely in the central part of the detector and to justify our calorimeter tower size. For this paper we focus on the range p T,fat = 350 ... 450 GeV, such that all top decay products can be easily captured in the fat jet. For signal events, we require that the fat jet can be associated with a Monte-Carlo truth top quark within ∆R < 1.2.
We can speed up the learning process or illustrate the ConvNet performance by applying a set of pre-processing steps: 1. Find maxima: before we can align any image we have to identify characteristic points.
Using a filter of size 3 × 3 pixels, we localize the three leading maxima in the image; 2. Shift: we then shift the image to center the global maximum taking into account the periodicity in the azimuthal angle direction; 3. Rotation: next, we rotate the image such that the second maximum is in the 12 o'clock position. The interpolation is done linearly; Throughout the paper we will apply two pre-processing setups: for minimal pre-processing we apply steps 1, 2 and 5 to define a centered jet image of given size. Alternatively, for full pre-processing we apply all five steps. In figure 1 we show averaged signal and background images based on the transverse energy from 10,000 individual images after full pre-processing. The leading subjet is in the center of the image, the second subjet is in the 12 o'clock position, and a third subjet from the top decay is smeared over the right half of the signal images. These images indicate that fully pre-processed images might lose a small amount of information at the end of the 12 o'clock axis.
A non-trivial pre-processing step is the shift in the η direction, since the jet energy E is not invariant under a longitudinal boost. Following Ref. [13] we investigate the effect on the mass information contained in the images, where η i and φ i are the center of the ith pixel after pre-processing. The study of all preprocessing steps and their effect on the image mass in figure 2 illustrates that indeed the rapidity shift has the largest effect on the E images, but this effect is not large. For the E T images the jet mass distribution is unaffected by the shift pre-processing step. The reason why our effect on the E images is much milder than the one observed in Ref. [13] is our condition |η fat | < 1. In the the lower panels of figure 2 we illustrate the effect of pre-processing on fat jets with |η| > 2, where the image masses changes dramatically. Independent of these details we use pre-processed E T images as our machine learning input [18,22,23,29]. Since networks prefer small numbers, we scale the images to keep most pixel entries between 0 and 1.

Network architecture
To identify a suitable DeepTop network architecture, we scan over several possible realizations or hyper-parameters. As discussed in the last section, we start with jet images of size 40 × 40. For architecture testing we split our total signal and background samples of 600,000 images each into three sub-samples: training (150,000 signal and background events each), validation/optimization (150,000 signal and background events each), and final test (300,000 signal and background events each). Networks are trained on the training sample. No early stopping is performed, but the set of weights minimizing loss on the validation/optimization sample is used to avoid overfitting.
In a first step we need to optimize our network architecture. The ConvNet side is organized in n c-block blocks, each containing n c-layer sequences of ZeroPadding, Convolution and Activation steps. For activation we choose the ReLU step function while weights are initialized by drawing from a Glorot uniform distribution [30]. Inside each block the size of the feature maps can be slightly reduced due to boundary effects. For each convolution we globally set a filter size or convolutional size n c-size × n c-size . The global number of kernels of corresponding feature maps is given by n c-kernel . Two blocks are separated by a pooling step, in our case using MaxPooling, which significantly reduces the size of the feature maps. For a quadratic pool size of p × p fitting into the n × n size of each feature map, the initial size of the new block's input feature maps is n/p × n/p. The final output feature maps are used as input to a DNN with n d-layer fully connected layers and n d-node nodes per layer.  In the left panel of figure 3 we show the performance of some test architectures. We give the complete list of tested hyper-parameters in Tab. 1. As our default we choose one of the best-performing networks on the validation/optimization sample after explicitly ensuring its stability with respect to changing its hyper-parameters. The hyper-parameters of the default network we use for fully as well as minimally pre-processed images are given in Tab. 1. In figure 4 we illustrate this default architecture.
In the second step we train each network architecture using the mean squared error as loss function and a the Nesterov algorithm with an initial learning rate η L = 0.003 and no momentum. Training is performed on mini-batches with a size of 1000 images per batch. We train our default setup over up to 1000 epochs and use the network configuration minimizing the loss function calculated on the validation/optimization sample. Different learning parameters were used to ensure convergence when training on the minimally pre-processed and the scale-smeared samples. Because the DNN output is a signal  figure 3 for fully pre-processed images. and background probability, the minimum signal probability required for signal classification is a parameter that allows to link the signal efficiency S with the mis-tagging rate of background events B . In section 3 we will use this trained network to test the performance in terms of ROC curves, correlating the signal efficiency and the mis-tagging rate.
Before we move to the performance study, we can get a feeling for what is happening inside the trained ConvNet by looking at the output of the different layers in the case of fully pre-processed images. In figure 5 we show the difference of the averaged output for 100 signal and 100 background images. For each of those two categories, we require a classifier output of at least 0.8. Each row illustrates the output of a convolutional layer. Signal-like red areas are typical for jet images originating from top decays; blue areas are typical for backgrounds. The first layer seems to consistently capture a well-separated second subjet,  and some kernels of the later layers seem to capture the third signal subjet in the right half-plane. While one should keep in mind that there is no one-to-one correspondence between the location in feature maps of later layers and the pixels in the input image, we will discuss these kinds of structures in the jet image below.
In figure 6 we show the same kind of intermediate result for the two fully connected DNN layers. Each of the 64 linear bars represents a node of the layer. We see that individual nodes are quite distinctive for signal and background images, but they cannot be linked to any pattern in the jet image. This illustrates how the two-dimensional ConvNet approach is more promising that a regular neural net. The fact that some nodes are not discriminative indicates that in the interest of speed the number of nodes could be reduced slightly. The output of the DNN is essentially the same as the probabilities shown in the right panel of figure 3, ignoring the central probability range between 20% and 80%.
To see which pixels of the fully pre-processed 40 × 40 jet image have an impact on the signal vs background label, we can correlate the deviation of a pixel x ij from its mean  valuex ij with the deviation of the label y from its mean valueȳ. A properly normalized correlation function for a given set of combined signal and background images can be defined as It is usually referred to as the Pearson correlation coefficient. From the definition we see that for a signal probability y positive values of r ij indicate signal-like patterns. In figure 7 we show this correlation for our network architecture. A large energy deposition in the center leads to classification as background. A secondary energy deposition in the 12 o'clock position combined with additional energy deposits in the right half-plane lead to a classification as signal. This is consistent with our expectations after full pre-processing, shown in figure 1.

Performance test
Given our optimized machine learning setup introduced in section 2 and the fact that we can understand its workings and trust its outcome, we can now compare its performance with state-of-the-art top taggers. The details of the signal and background samples and jet images are discussed in section 2.1; essentially, we attempt to separate a top decay inside a fat jet from a QCD fat jet including fast detector simulation and for the transverse momentum range p T,fat = 350 ... 450 GeV. Other transverse momentum ranges for the fat jet can be targeted using the same DNN method. Because we focus on a comparing the performance of the DNN approach with the performance of standard multivariate top taggers we take our Monte Carlo training and testing sample as a replacement of actual data. This distinguishes our approach from tagging methods which use Monte Carlo simulations for training, like the Template Tagger [5]. This means that for our performance test we do not have to include uncertainties in our Pythia simulations compared to other Monte Carlo simulations and data [32].

QCD-based taggers
Acting on the same calorimeter entries in the rapidity vs azimuthal angle plane which define the jet image, we can employ QCD-based algorithms to determine the origin of a given configuration. Based on QCD jet algorithms, for example the multivariate HEPTopTag-ger2 [6,7,19,20] extracts hard substructures using a mass drop condition [1] with a given f drop = 0.8. Provided that at least three hard substructures exist, different constraints on the invariant masses of combinations of filtered substructures [1] define a top tag. One of the features of the HEPTopTagger is that even in the multivariate analysis it will always identify a top candidate with a three-prong decay and the correct reconstructed top mass. An alternative approach is to groom the fat jet using the SoftDrop criterion [33] min(p T 1 , p T 2 ) and employ the groomed jet mass. It can be thought of a combination of p T -drop criterion [4] with a soft-collinear extension of pruning [34]. We use the SoftDrop parameters β = 1 and z cut = 0.2. The main difference between the HEPTopTagger and SoftDrop is that the latter does not explicitly target the top and W decays, needs an additional condition on a mass scale to work as a tagger, and will not reconstruct the top 4-momentum.
Because of these much weaker constraints on the top candidate kinematics a SoftDrop construction is ideally suited for a hypothesis test differentiating between fat QCD jets and top decay jets.
The QCD shower-based taggers alone are known to not fully use the available calorimeter information. However, they can be complemented by a simple observable quantifying the number of constituents inside the fat jet or the number of prongs in the top decay. Adding the N -subjettiness [36] variables to the HEPTopTagger or SoftDrop picks up this additional information and also induces the three-prong top decay structure into SoftDrop. We use N k T -axes, β = 1 and the reference distance R 0 . A small value τ N indicates consistency with N or less substructure axes, so an N -prong decays give rise to a small ratio τ N /τ N −1 . For top tagging τ 3 /τ 2 is particularly useful in combination with QCD taggers in a multivariate setup [20]. The N -subjettiness variables τ j can be defined based on the complete fat jet or based on the fat jet after applying the SoftDrop criterion. Using τ j and τ sd j in a multivariate analysis usually leads to optimal result.

Comparison
To benchmark the performance of our DeepTop DNN, we compare its ROC curve with standard Boosted Decision Trees based on the C/A jets using SoftDrop combined with N -subjettiness. From figure 3 we know the spread of performance for the different network architectures for fully pre-processed images. In figure 8 we see that minimal pre-processing actually leads to slightly better results, because the combination or rotation and cropping described in section 2.1 leads to a small loss in information. Altogether, the band of different machine learning results indicates how large the spread of performance will be whenever for example binning issues in p T,fat are taken into account, in which case we we would no longer be using the perfect network for each fat jet.
For our BDT we use GradientBoost in the Python package sklearn [29] with 200 trees, a maximum depth of 2, a learning rate of 0.1, and a sub-sampling fraction of 90% for the kinematic variables where m fat is the un-groomed mass of the fat jet. This is similar to standard experimental approaches for our transverse momentum range p T,fat = 350 ... 400 GeV. In addition, we include the HEPTopTagger2 information from filtering combined with a mass drop criterion, In figure 8 we compare these two QCD-based approaches with our best neural networks. Firstly, we see that both QCD-based BDT analyses and the two neural network setups are close in performance. Indeed, adding HEPTopTagger information slightly improves the SoftDrop+N -subjettiness setup, reflecting the fact that our transverse momentum range is close to the low-boost scenario where one should rely on the better-performing HEPTopTagger. Second, we see that the difference between the two pre-processing scenarios is in the same range as the difference between the different approaches. Running the DeepTop framework over signal samples with a 2-prong W decay to two jets with m W = m t and over signal samples with a shifted value of m t we have confirmed that the neural network setup learns both, the number of decay subjets and the mass scale.
Following up on on the observation that the neural network and the QCD-based taggers show similar performance in tagging a boosted top decay inside a fat jet, we can check what kind of information is used in this distinction.
Both for the DNN and for the MotherOfTaggers BDT output we can study signallike learned patterns in actual signal events by cutting on the output label y corresponding to the 30% most signal like events shown on the right of figure 3. Similarly, we can require the 30% most background like events to test if the background patterns are learned correctly. In addition, we can compare the kinematic distributions in both cases to the Monte Carlo truth. In figure 9 we show the distributions for m fat and τ 3 /τ 2 , both part the set of observables defined in Eq. (3.5). We see that the DNN and BDT tagger indeed learn essentially the same structures. The fact that their results are more pronounced signal-like than the Monte Carlo truth is linked to our stiff cut on y, which for the DNN and BDT tagger cases removes events where the signal kinematic features is less pronounced. The MotherOfTaggers curves for the signal are more peaked than the DeepTop curves is due to the fact that the observables are exactly the basis choice of the BDT, while for the neural network they are derived quantities. In App. A we extend this comparison to more kinematic observables. Finally, a relevant question is to what degree the information used by the neural network is dominated by low-p T effects. We can apply a cutoff, for example including only pixels with a transverse energy deposition E T > 5 GeV. This is the typical energy scale where the DNN performance starts to degrade, as we discuss in more detail in App. B.

Conclusions
Fat jets which can include the decay products of a boosted, hadronically decaying top quark are an excellent basis to establish machine learning based on fat jet images and compare their performance to QCD-based top taggers. Here, machine learning is the logical next step after developing multivariate top taggers which test QCD vs top decay hypotheses rather than identifying and reconstructing an actual top decay. This includes the assumption that our ConvNet DeepTop approach will be trained purely on data.
We have constructed a ConvNet setup, inspired by standard image recognition techniques [13,17]. To optimize the network architecture, train the network, and test the performance we have used independent event samples. First, we have found that changes in the network architecture only have a small impact on the top tagging performance. Pre-processing the fat jet images is useful to visualize, understand, and follow the network learning procedure for example using the Pearson correlation coefficient, but has little influence on the network performance.
As a base line we have constructed a MotherOfTaggers QCD-based top tagger, implemented as a multivariate BDT. This allowed us to quantify the performance of the DeepTop network and to test which kinematic observables in the fat jet have been learned by the neural network. We have also confirmed that the neural network is not dominated by low-p T calorimeter entries and extraordinarily stable with respect to changing the jet energy scale. In figure 8 we found that the performance of the two approaches is comparable, giving us all the freedom to define future experimental strategies for top tagging, ranging from proper top reconstruction to multivariate hypothesis testing and, finally, data-based machine learning.

A What the machine learns
For our performance comparison of the QCD-based tagger approach and the neural network it is crucial that we understand what the DeepTop network learns in terms of physics variables. The relevant jet substructure observables differentiating between QCD jets and top jets are those which which we evaluate in the MotherOfTaggers BDT, Eq.(3.5).
To quantify which signal features the DNN and the BDT tagger have correctly extracted we show observables for signal event correctly identified as such, i. e. requiring events with a classifier response y corresponding to the 30% most signal like events. Following figure 3 this cut value captures a large fraction of correctly identified events. The same we also do for the 30% most background like events identified by each classifier.  The upper two rows in figure 10 show the different mass variables describing the fat jet. We see that the DNN and the BDT tagger results are consistent, with a slightly better performance of the BDT tagger for clear signal events. For the background the BDT output is more pronounced as well. The deviation from the true mass for the HEPTopTagger background performance is explained by the fact that many events with no valid top candidate return m rec = 0. Aside from generally comforting results we observe a peculiarity: the SoftDrop mass identifies the correct top mass in fewer that half of the correctly identified signal events, while the fat jet mass m fat does correctly reproduce the top mass. The reason why the SoftDrop mass is nevertheless an excellent tool to identify top decays is that its background distribution peaks at very low values, around m sd ≈ 20 GeV. Even for m sd ≈ m W the hypothesis test between top signal and QCD background can clearly identify a massive particle decay.
In the third row we see that the HEPTopTagger W -to-top mass ratio f rec only has little significance for the transverse momentum range studied. For the optimalR variable ∆R opt [20] the DNN and the BDT tagger again give consistent results. Finally, for the N -subjettiness ratio τ 3 /τ 2 before and after applying the SoftDrop criterion the results are again consistent for the two tagging approaches.
Following up on the observation that SoftDrop shows excellent performance as a hypothesis test, we show in figure 11 the reconstructed transverse momenta of the fat jet, or the top quark for signal events. In the left panel we see that the transverse momentum of the un-groomed fat jet reproduces our Monte-Carlo range p T,fat = 350 ... 450 GeV. While the transverse momentum distributions of signal and background are very similar, applying the BDT or DNN induces a bias which indicates a transverse momentum dependent tagger response. The transverse momentum dependence is larger for the DNN. A tagger turnon with transverse momentum is unproblematic and can be mitigated using adverserial training techniques [35] if needed. In the right panel we see that the constituents identified by the SoftDrop criterion have a significantly altered transverse momentum spectrum. To measure the transverse momentum of the top quark we therefore need to rely on a top identification with SoftDrop, but a top reconstruction based on the (groomed) fat jet properties.

B Detector effects
A key question for the tagging performance is the dependence on the activation threshold. Figure. 12 shows the impact of different thresholds on the pixel activation, i. e. E T used both for training and testing the networks. Removing very soft activity, below 3 GeV, only slightly degrades the network's performance. Above 3 GeV the threshold leads to an approximately linear decrease in background rejection with increasing threshold.
An second, important experimental systematic uncertainty when working with calorimeter images is the jet energy scale (JES). We assess the stability of our network by evaluating the performance on jet images where the E T pixels are globally rescaled by ±25%. As shown in the right panel of figure 12 this leads to a decline in the tagging performance of approximately 10% when reducing the JES and 5% when increasing the JES.
Next, we train a hardened version of the network. It uses the same architecture as our default, but during the training procedure each image is randomly rescaled using a Gaussian distribution with a mean of 1.0 and a width of 0.1. New random numbers are used from epoch to epoch. The resulting network has a similar performance as the default and exhibits a further reduced sensitivity to changes in the global JES.
While other distortions of the image, such as non-uniform rescaling, will need to be considered, the resilience of the network and our ability to further harden it are very encouraging for experimental usage where the mitigation and understanding of systematic uncertainties is critical.