Jet-images — deep learning edition

Building on the notion of a particle physics detector as a camera and the collimated streams of high energy particles, or jets, it measures as an image, we investigate the potential of machine learning techniques based on deep learning architectures to identify highly boosted W bosons. Modern deep learning algorithms trained on jet images can out-perform standard physically-motivated feature driven approaches to jet tagging. We develop techniques for visualizing how these features are learned by the network and what additional information is used to improve performance. This interplay between physicallymotivated feature driven tools and supervised learning algorithms is general and can be used to significantly increase the sensitivity to discover new particles and new forces, and gain a deeper understanding of the physics within jets.


Introduction
Collimated sprays of particles, called jets, resulting from the production of high energy quarks and gluons provide an important handle to search for signs of physics beyond the Standard Model (SM) at the Large Hadron Collider (LHC).In many extensions of the SM, there are new, heavy particles that decay to heavy SM particles such as W , Z, and Higgs bosons as well as top quarks.As is often the case, the mass of the SM particles is much smaller than the mass of the new particles and so they are imparted with a large Lorentz boost.As a result, the SM particles from the boosted boson and top quark decays are highly collimated in the lab frame and may be captured by a single jet.Classifying the origin of these jets and differentiating them from the overwhelming Quantum Chromodynamic (QCD) multijet background is a fundamental challenge for searches with jets at the LHC.Jets from boosted bosons and top quarks have a rich internal substructure.There is a wealth of literature addressing the topic of jet tagging by designing physics-inspired features to exploit the jet substructure (see e.g.Ref. [1][2][3]).However, in this paper we address the challenge of jet tagging though the use of Machine Learning (ML) and Computer Vision (CV) techniques combined with low-level information, rather than directly using physics inspired features.In doing so, we not only improve discrimination power, but also gain new insight into the underlying physical processes that provide discrimination power by extracting information learned by such ML algorithms.
The analysis presented here is an extension of the jet-images approach, first introduced in Ref. [4] and then also studied with similar approaches by Ref. [5], whereby jets are represented as images with the energy depositions of the particles within the jet serving as the pixel intensities.When first introduced, jet image pre-processing techniques based on the underlying physics symmetries of the jets were combined with a linear Fisher discriminant to perform jet tagging and to study the learned discrimination information.Here, we make use of modern deep neural networks (DNN) architectures, which have been found to outperform competing algorithms in CV tasks similar to jet tagging with jet images.While such DNNs are significantly more complex than Fisher discriminants, they also provide the capability to learn rich high-level representations of jet images and to greatly enhance discrimination power.By developing techniques to access this rich information, we can explore and understand what has been learned by the DNN and subsequently improve our understanding of the physics governing jet substructure.We also re-examine the jet pre-processing techniques, to specifically analyze the impact of the pre-processing on the physical information contained within the jet.
Automatic feature extraction and high-level learned feature representations via deep learning have led to state-of-the-art performance in Computer Vision [6][7][8].The focus of this work is on robust networks architectures to investigate what information and higher level representations a fullyconnected multi-layer network and a convolutional neural network learn about jets.There will be a focus on connecting the gains in performance with the underlying physical properties of jets through visualization.This paper is organized as follows: The details of the simulated data sets and the definition of jet-images are described in Section 2. The pre-processing techniques, including new insights into the relationship with underlying physics information, is discussed in Section 3. We then introduce the deep neural network architectures that we use in Section 4. The discrimination performance and the exploration of the information learned by the DNNs is presented in Section 5.
To simulate highly boosted W bosons, a hypothetical W boson is generated and forced to decay to a hadronically decaying W boson (W → qq ) and a Z boson which decays invisibly (Z → ν ν).The mass of the W boson determines the Lorentz boost of the W boson in the lab frame since the W is produced nearly at rest and the W boson momentum is approximately m W /2. The invisible decay of the Z boson ensures that the jet in the event with the highest transverse momentum is the W boson jet.Multijet production of quarks and gluons is simulated as a background.Both the W signal and the multijet background are generated using Pythia 8.170 [25,26] at √ s = 14 TeV.The minimum angular separation of the W boson decay products in the plane transverse to the beam direction scales as 2m W /p T,W , where m W ≈ 80 GeV and p T,W is the component of the W boson momentum in this plane.The tagging strategy and performance depend strongly on p T,W , so we focus on a particular range: 250 GeV < p T,W < 300 GeV.This corresponds to an angular spread of about ∆R = ∆η 2 + ∆φ 2 ∼ 0.6, where ∆η and ∆φ are the distances between W boson decay products in (η, φ) coordinates.The decay products of the W bosons as well as the background are clustered into jets using the anti-k t algorithm [27] via FastJet [28] 3.0.3.To mitigate the contribution from the underlying event, jets are are trimmed [29] by re-clustering the constituents into R = 0.3 k t subjets and dropping those which have p subjet T < 0.05 × p jet T .Trimming also reduces the impact of multiple proton-proton collisions occurring in the same event as the hard-scatter process (pileup).We leave investgiation of the robustness of the neural network performance to pileup for future studies.
Three key jet features for distinguishing between W jets and QCD jets are the jet mass, nsubjettiness [30] and the distance in (η, φ) space between subjets of the trimmed jet (∆R).The distributions of these three discriminating variables are shown in Fig. 1.The jet mass is defined as m 2 jet = i,j p i p j , with jet constituent four-vectors p i , and is a proxy for the boson mass in the case of W boson events.In the case of QCD background jets, the jet mass scales with the transverse momentum and the size of the jet.N -subjettiness, in the form of τ 21 , is a measure of the likelihood that the jet has two hard prongs instead of one hard prong.In this application, the winner-takes-all axis [31] is used to define the axis in the τ 21 calculation.One other useful feature is the jet transverse momentum.However, since many of the other features have a strong dependence on the jet transverse momentum, we re-weight the signal so have the same p T distribution as the background.
To model the discretization and finite acceptance of a real detector, a calorimeter of towers with size 0.1 × 0.1 in (η, φ) extends out to η = 5.0.The total energy of the simulated particles incident upon a particular cell are added as scalars and the four-vector p j of any particular tower j is given by where E i is the energy of particle i and the center of the tower j is (η j , φ j ).Towers are treated as massless.
A jet image is formed by taking the constituents of a jet and discretizing its energy into pixels in (η, φ), with the intensity of each pixel given by the sum of the energy of all constituents of the jet inside that (η, φ) pixel.We also investigate the use of the transverse projection of the energy in each tower as the pixel intensity.In our studies, we take the jet image pixelation to match the simulated calorimeter tower granularity.In the next section, we will discuss the nuances of standardizing the coordinates of a jet image as a pre-processing step prior to applying machine learning.

Pre-processing and the Symmetries of Space-time
In order for the machine learning algorithms to most efficiently learn discriminating features between signal and background and to not learn the symmetries of space-time, the jet images are pre-processed.This procedure can greatly improve performance and reduce the required size of the sample used for testing.Our pre-processing procedure happens in four steps: translation, rotation, re-pixelation, and inversion.To begin, the jet images are translated so that the leading subjet is at (η, φ) = (0, 0).Translations in φ are rotations around the z-axis and so the pixel intensity is unchanged by this operation.On the other hand, translations in η are Lorentz boosts along the z-axis, which do not preserve the pixel intensity.Therefore, a proper translation in η would modify the pixel intensity.One simple modification of the jet image to circumvent this change is to replace the pixel intensity E i with the transverse energy p T,i = E i / cosh(η i ).This new definition of intensity is invariant under translations in η and is used exclusively for the rest of this paper2 .
The second step of pre-processing is to rotate the images around the center of the jet.If a jet has a second subjet, then the rotation is performed so that the second subjet is at −π/2.If no second subjet exists, then the jet image is rotated so that the first principle component of the pixel intensity distribution is aligned along the vertical axis.Unless the rotation is by an integer multiple of π/4, the rotated grid will not line up with the original grid.Therefore, the energy in the rotated grid must be re-distributed amongst the pixels of the original image grid.A cublic spline interpolation is used in this case -see Ref. [4] for details.The last step is a parity flip so that the right side of the jet image has the highest sum pixel intensity.
Figure 2 shows the average jet image for W boson jets and QCD jets before and after the rotation, re-pixelation, and parity flip steps of the pre-processing.The more pronounced second-subjet can already be observed in the left plots of Fig. 2, where there is a clear annulus for the signal W jets which is nearly absent for the background QCD jets.However, after the rotation, the second core of energy is well isolated and localized in the images.The spread of energy around the leading subjet is more diffuse for the QCD background which consists largely of gluon jets, which have an octet radiation pattern, compared to the singlet radiation pattern of the W jets, where the radiation is mostly restricted to the region between the two hard cores.The average jet image for signal W jets (top) and background QCD jets (bottom) before (left) and after (right) applying the rotation, re-pixelation, and inversion steps of the pre-processing.The average is taken over images of jets with 240 GeV < p T < 260 GeV and 65 GeV < mass < 95 GeV.
One standard pre-processing step that is often additionally applied in Computer Vision tasks is normalization.A common normalization scheme is the L 2 norm such that I 2 i = 1 where I i is the intensity of pixel i.This is particularly useful for the jet images where pixel intensities can span many orders of magnitude, and when there is large pixel intensity variations between images.In this study, the jet transverse momenta are all around 250 GeV, but this can be spread amongst many pixels or concentrated in only a few.The L 2 norm helps mitigate the spread and thus makes training easier for the machine learning algorithm.However, normalization can distort information contained within the jet image.Some information, such as the Euclidean distance ∆R between subjets in (η, φ) is invariant under all of the pre-processing steps as well as normalization.However, consider the image mass, where E i = I i /cosh(η i ) for pixel intensity I i and θ ij is the angle between massless four-vectors with η and φ at the i and j pixel centers.The image mass is not invariant under all pre-processing steps but does encode key information to identify highly boosted bosons that would ideally be preserved by the pre-processing step.As discussed earlier, with the proper choice of pixel intensity, translations preserve the image mass since it is a Lorentz invariant quantity.However, the rotation pre-processing step does not preserve the image mass.To understand this effect, consider two four-vectors: p µ = (1, 0, 0, 1) and q µ = (0, 1, 0, 1).The invariant mass of these vectors is √ 2. The vector p µ is at the center of the jet image coordinates and the vector q µ is located at π/2 degrees.If we rotate the image around the jet axis so that the vector q µ is at 0 degrees, akin to rotating the jet image so that the sub-leading subjet goes from π/2 to 0, then p µ is unchanged but q µ → (1, 0, sinh(1), cosh(1)).The new invariant mass of q µ and p µ is about 1, which is reduced from its original value of √ 2. The parity inversion pre-processing step does not impact the image mass, but a I 2 normalization does modify the image mass.The easiest way to see this is to take a series of images with exactly the same image mass but variable I 2 norm.The map I i → I i / j I 2 j modifies the mass by m I → m I / j I 2 j and so the variation in the normalizations induces a smearing in the jet-image mass distribution.
The impact of the various stages of pre-processing on the image mass are illustrated in Fig. 3.The finite segmentation of the simulated detector slightly degrades the jet mass resolution, but the translation and parity inversion (flip) have no impact, by construction, on the jet mass.The rotation that will have the biggest potential impact on the image mass is when the rotation angle is π/2 (maximally changing η and φ), which does lead to a small change in the mass distribution.A translation in η that uses the pixel energy as the intensity instead of the transverse momentum, which we refer to as a naive translation, or an L 2 normalization scheme both significantly broaden the mass distribution.One way to quantify the amount of information in the jet mass that is lost by various pre-processing steps is shown in the Receiver Operator Characteristic (ROC) curve of Fig. 4, which shows the inverse of the background efficiency versus the signal efficiency for passing a threshold on the signal-to-background likelihood ratio of the mass distribution (as described in Section 5).Information about the mass is lost when the ability to use the mass to differentiate signal and background is diminished.The naive translation and the I 2 normalization schemes are significantly worse than the other image mass curves which are themselves similar in performance.The distribution of the image mass after various states of pre-processing for signal jets (left) and background jets (right).The No pixelation line is the jet mass without any detector granularity and without any pre-processing.Only pixelation has only detector granularity but no pre-processing and all subsequent lines have this pixelation applied as well as translation to center the image at the origin.The translation is called naive when the energy is used as the pixel intensity instead of the pixel transverse momentum.Flip denotes the parity inversion operation and the p 2 T norm is a L 2 normalization scheme.The naive translation and the I 2 normalization image masses are both multiplied by constants so that the centers of the distribution are roughly in the same location as for the other distributions.
the sparse nature of these images.However, since speed is not our driving force in this work, we used convolution implementations defined for dense inputs.We also study fully connected MaxOut networks [7].Other architectures were also studied, such as Stack Denoising Autoencoders [32], and multi-layer fully connected networks with various activation functions, but found that convolution and MaxOut networks were the most performant.
As a brief aside, we discuss some of the key neural network concepts which are used in the following section to describe our network architectures.Fully connected (FC) layers take all features as input.Convolution networks utilize convolution filters (or kernels) which are a set of weights W that operate linearly on a small n × n (horizontal × vertical) patch of the input image.For instance, a 3 × 3 filter takes as input a 3 × 3 patch of pixels and outputs z = 3 i,j=1 x ij W ij , where x ij is the input image patch.The filter output can be considered as centered on that patch.Each filter is convolved with the input image, in that the filter is applied to a given input patch and then moved horizontally and/or vertically to a new input patch on which the filter is applied.By scanning over the entire image in this way, a the filter is convolved with the input, producing a convolved output.An important consideration when using convolutional networks is how one handles borders of images.Two main options exist -one can consider only n × n patches that are fully contained within the input images, or one can consider every convolution that has at least one pixel from the image, zero-padding as The tradeoff between W boson (signal) jet efficiency and inverse QCD (background) efficiency for various pre-processing algorithms applied to the jet (images).The No pixelation line is the jet mass without any detector granularity and without any pre-processing.Only pixelation has only detector granularity but no pre-processing and all subsequent lines have this pixelation applied as well as translation to center the image at the origin.The translation is called naive when the energy is used as the pixel intensity instead of the pixel transverse momentum.Flip denotes the parity inversion operation and the p 2 T norm is a L 2 normalization scheme.
necessary to create valid convolutions.We use the latter, as we found better performance and better, more physics-driven filters.
A non-linear activation function is typically applied to these convolution outputs, for which we use the Rectified Linear Unit (ReLU) [33] that takes an input z and outputs max{0, z}.ReLU's have been found to improve network training time, whilst having enough non-linear behavior to not degrade network performance.In addition, Rectified Linear Units do not suffer from a vanishing gradient, and speed up computation time while allowing for sparse networks by having true zero-valued activations.After convolution(+activation) layers, a non-linear down-sampling is frequently performed using Maxpooling [34] which takes non-overlapping patches of convolution outputs as input, and outputs the maximum value for each patch.A conceptual visualization of the convolution + Max-pooling network architecture that we employ can be seen in Figure 5.
Finally, the MaxOut network makes use of the dense (Fully Connected) MaxOut activation unit, which takes an input vector x and computes k linear weightings z j∈[1,k] = i x i W ij + b j and outputs max j∈[1,k] z j .Natural extensions of MaxOut layers to convolutional units exist, but were not examined.Conceptually, one can view the Rectified Linear Unit as a special case of the MaxOut with k = 2 and with one of the weightings forced to output only zero.Though MaxOut units do not force sparsity of activation outputs in the same way as ReLU units, MaxOut networks provide the desirable attribute that they pair nicely with the model averaging effects of dropout in a natural way [7].

Architectural Selection
For the MaxOut architecture, we utilize two FC layers with MaxOut activation (the first with 256 units, the second with 128 units, both of which have 5 piecewise components in the MaxOut-operation), followed by two FC layers with ReLU activations (the first with 64 units, the second with 25 units), followed by a FC sigmoid layer for classification.We found that the He-uniform initialization [35] for the initial MaxOut layer weights was needed in order to train the network, which we suspect is due to the sparsity of the jet-image input.In cases where other initialization schemes were used, the networks often converged to very sub optimal solutions.This network is trained (and evaluated) on un-normalized jet-images using the transverse energy for the pixel intensities For the deep convolution networks, we use a convolutional architecture consisting of three sequential [Conv + Max-Pool + Dropout] units, followed by a local response normalization (LRN) layer [8], followed by two fully connected, dense layers.We note that the convolutional layers used are so called "full" convolutions -i.e., zero padding is added the the input pre-convolution.Our architecture can be succinctly written as: (4.1)The convolution layers each utilize 32 feature maps, or filters, with filter sizes of 11 × 11, 3 × 3, and 3 × 3 respectively.All convolution layers are regularized with the L 2 weight matrix norm.A down-sampling of (2, 2), (3,3), and (3, 3) is performed by the three max pooling layers, respectively.A dropout [8] of 20% is used before the first FC layer, and a dropout 10% is used before the output layer.The FC hidden layer consists of 64 units.
After early experiments with the standard 3 × 3 filter size, we discovered significantly worse performance over a more basic MaxOut [7] feedforward network.After further investigation into larger convolutional filter size, we discovered that larger-than-normal filters work well on our application.Though not common in the Deep Learning community, we hypothesize that this larger filter size is helpful when dealing with sparse structures in the input images.In Table 1, we compare different filter sizes, finding the optimal filter size of 11 × 11, when considering the Area Under the ROC Curve (AUC) metric, based on the ROC curve outlined in Sections 3 and 5. Two convolution networks, which differ in their pre-processing, are studied in this paper.The first, which we refer to as the ConvNet, is trained (and evaluated) on un-normalized jet-images using the transverse energy for the pixel intensities.The second, which we refer to as ConvNet-Norm, is trained (and evaluated) on L 2 normalized jet-images using the transverse-energy for the pixel intensities.Examining the performance of both networks allows us to study the possible effects of normalization in the pre-processing.

Implementation and Training
All Deep Learning experiments were conducted in Python with the Keras [36] Deep Learning library, utilizing NVIDIA C2070 graphics cards.One GPU was used per training, but several architectures were trained in parallel on different GPU's to optimize the performance of networks with different hyper-parameters.
We used 8 million training examples, with an additional 2 million validation samples for tuning the hyper-parameters, and 3 million testing samples.Signal examples are weighted such that the total sum of weights is the same as the total number of background examples (as explained in Section 2).These weights are used by the cost function in the training and in the ROC curve computations of the test samples.The networks were trained with the Adam [37] algorithm (Stochastic Gradient Descent with Nesterov Momentum [38] was also examined, but did not provide performance gains).The training consisted of 100 epochs, with a 10 epoch patience parameter on the increase in AUC between 0.2 and 0.8 on a validation set.Batch sizes of 32 were used for the MaxOut network, while batch sizes of 96 were used for the convolution networks.

Analysis and Visualization
In this section, we examine the performance of the MaxOut and Convolution deep neural networks, described in Section 4, in classifying boosted W ± → qq from QCD jets.As one of our primary goals is to understand what these NN's can learn about jet topology for discrimination, we focus on a restricted phase space of the mass and transverse momentum of the jets.In particular, we restrict our studies to 250 GeV ≤ p T ≤ 300 GeV, and confine ourselves to a 65 GeV ≤ m ≤ 95 GeV mass window that contains the peak of the W .We also perform studies in which the discrimination power of the most discriminating physics variables has been removed, either though sample weighting or highly restrictive phase space selections, which allows us to focus on information learned by the networks beyond such known physics variables.In this way, we construct a scaffolded and multi-approach methodology for understanding, visualizing, and validating neural networks within this jet-physics study, though these approaches could be used broadly.
The primary figure of merit used to compare the performance of different classifiers is the ROC curve.The ROC curves allow us to examine the entire spectrum of trade-off between Type-I and Type-II errors3 , as many applications of such classifiers will choose different points along the trade-off curve.
Since the classifier output distributions are not necessarily monotonic in the signal-to-background ratio, for each classifier we compute the signal-to-background likelihood ratio 4 .The ROC curves are computed by applying a threshold to the classifier output likelihood ratio, and plotting the inverse of the fraction of background jet passing the threshold (the background rejection) versus the fraction of signal events passing the threshold (the signal efficiency).We say that a classifier is strictly more performant if the ROC curve is above a baseline for all efficiencies.In decision theory, this is often referred to as domination (i.e. one classifier dominates another).It should be noted that any weights used to modify the distributions of jets (e.g. the p T weighting described in Section 2) are also used when computing the ROC curves.
For information exploration, several techniques were used: • ROC Curve Comparisons to Multi-Dimensional Likelihood Ratios: By combining several physics-inspired variables and computing their joint likelihood ratio, we can explore the difference between such multi-dimensional likelihood ratios and the neural networks' performance.
We also compute the joint likelihood ratio of the neural network output and physics-inspired variables.If such joint classifiers improve upon the neural network performance, then we can consider the information in the physics-inspired variable (conditioned on the neural network output) as having been learned by the neural network.If the joint classifier shows improved performance over the neural network, then the neural network has not completely learned the information contained in the physics-inspired variable.
• Convolution Filters: For convolution neural networks, we display the weights of the 11x11 filters as images.These filters show how discrimination information is distributed throughout patches of the jets and give a view of the higher level representations learned by the network.However, such filters are not always easy to interpret, and thus we also convolve each filter with a set of signal and background jet-images.We then examine the difference between the convolution output on the average signal jet-images and average background jet-images.These difference give deeper insight into how the filters act on the jets to accentuate discriminating information.
• Joint and Conditional Distributions: We examine the joint and conditional distributions of various physics inspired features and the neutral network outputs.If the conditional distribution of the physics variable v given the neural network output O is not independent of the neutral network output, i.e.P (v|O) = P (v) ∀ O, then we consider the network to have learned information about this physics feature.
• Average, Difference, and Fisher Jet-Images: We examine average images for signal and background and their differences, as well as the Fisher Jets.This is particularly illuminating when we select jets with specific values of highly discriminating physics-inspired variables.This allows us to explore discriminating information contained in the jet images beyond the physics inspired variables.
• Neural Network Correlations per Pixel: We compute the linear correlations (i.e.Pearson correlation coefficient) between the neural network output and the distributions of intensity in each pixel.This allows for a visualization of how the discriminating information learned by the neural network is distributed throughout the jet.These visualizations are an approximation to the neural network discriminator and can be used to aid the development of new physics inspired variables (much like the Fisher Jet visualization).
The performance evaluation and information exploration techniques are examined in three settings, all of which require the aforementioned mass and transverse momentum selection.
1. General Phase Space: No alterations are made to the phase space.This gives an overview of the performance and information learned by the networks 2. Uniform Phase Space: The weight of each jet is altered such that the joint distributions of mass, n-subjettiness, and p T are non-discriminative.Specifically, we derive weights such that: Both the weighting and network evaluation are performed in a slightly more restricted phase space requiring τ 21 ∈ [0.2, 0.8].While p T is weighted in all phase space setting, mass and nsubjettiness are also weighted in this setting as they are amongst the most discriminating physicsinspired variables.This weighting ensures that mass, n-subjettiness, and p T do not contribute to differences between signal and background, and thus this information is essentially removed from the discrimination power of the samples.This allows us to examine what information beyond these variables has been learned and to understand where the neural network performance improvements beyond these physics derived variables comes from.Neural networks that are trained in the General Phase Space are applied as the discriminant under this "flattening" transformation.We also use the training weights inside this window to train an additional convolution network.We look for increases in performance that would indicate information learned beyond the information contained in the weighted physics variables.This highly restricted window provides a different method to effectively remove the discrimination power of mass, n-subjettiness, and p T as there is little to no variation of the variables in this phase space for either signal or background.Thus, any discrimination improvements of the neural networks over the physics-inspired variables would be coming from information learned beyond these variables.While the weighting in the Uniform Phase Space is designed also to remove such discrimination, it produces a non-physical phase space.The Highly Restricted Phase Space allows us to ensure that the neural network performance improvements are valid and transferrable to a less contrived phase space.
By examining the performance of the neural networks in these different phase spaces, we aim to systematically remove known discriminative information from the networks' performance and thereby probe the information learned beyond what is already known by physics inspired variables.

Studies in the General Phase Space
In order to evaluate the overall discrimination performance of the DNNs to that of the physicsdriven variables, we examine the ROC curves in Figure 6.In particular, we compare the DNNs to n-subjettiness [30] τ 21 = τ 2 /τ 1 , the jet mass, and the distance ∆R between the two leading p T subjets.
In Figure 6a, we can see that the three DNNs have similar performance, but the MaxOut networks outperforms the ConvNet networks.We suspect that the MaxOut outperforms the ConvNets due to sparsity of the jet-images, whereby the MaxOut network views the full jet-image from the inital hidden layer while the sparsity tends to make it difficult for the ConvNets to learn meaningful convolution filters.We also see that the ConvNet-Norm outperforms the ConvNet trained on the un-normalized jet-images.We observe that the classification performance of the ConvNet discriminant is highest when jet images are normalized, despite the fact that image normalization destroys jet mass information from the images.As we will see soon, it is difficult for these networks to fully learn the jet mass, so the lack of of mass information from pre-processing does not necessarily lead to worse discrimination performance.
On the other hand, normalization is having an impact on the ability to effectively train the ConvNet network on jet images.Finally, we see that the DNNs significantly improve the discrimination power relative to the Fisher-Jet discriminant 5 , as described in reference [4].In addition, in Figure 6b we see that the DNNs also outperform the two-variable combinations of the physics inspired variables (computed using the 2D likelihood ratio6 ).It is interesting to note that combining mass and τ 21 , or τ 21 and ∆R, achieve much higher performance than the individual variables and are significantly closer to the performance of the DNNs.However, the large difference in performance between the DNNs and the physics-variable combinations implies the DNNs are learning information beyond these physics variables.
While we can see in Figure 6 that the DNNs outperform the individual and two-variable physics inspired discriminators, we want to understand if these physics variables have been learned by the networks.As such, we compute the combination of the DNNs with each of the physics inspired variables (using the 2D likelihood), as seen for the ConvNet in Figure 7a and for the MaxOut network in Figure 7b.In both cases, we see that the discriminators combining ∆R or τ 21 with the DNNs does not improve performance.This indicate that the discriminating information in these variables relevant for the classification task has already been fully learned by the networks 7 .However, adding mass in combination with the DNNs shows a noticeable improvement in performance over the DNNs alone.This indicates that not all of the discriminating information relevant for jet tagging contained in the mass variable has been learned by the DNNs.While it is not shown, similar patterns are found for the Convnet-Norm network.
The conditional distributions between the DNN output and the physics-variables are shown in Figure 8a for the ConvNet network against the jet mass, ∆R, and τ 21 .These distributions are normalized in bins of the DNN output, and thus the z-axis shows a discretized estimate of the conditional probability density of a physics variable value given the network output (i.e.Pr(variable|network output)).Normalizing the distributions in this way allows us to see the most probable values of the physics variables at each point of the network output, without being affected by the overall distribution of jets in this 2D space.There is a strong non-linear relationship between τ 21 and ∆R, giving further evidence that this information has been learned by the network.However, the correlations are much

Understanding what is learned
In order to gain a deeper understanding of the physics leaned by the DNNs, in this section we examine how the internal structure of the network relates to the substructure and properties of W bosons versus QCD jets.
In Figure 9a, we show the first layer 11×11 convolutional filters learned by the Conv-Norm network.Each filter is visualized by showing the learned weight in each position of the filter W ij from Section 4. We can see that there is variation between filters, indicating that they are learning different features of the jet-images, but this variation is not as large as seen in many CV problems due to the sparsity of the jet-images.We also see that they tend to learn representations of the subjets and distances between subjets, as seen by the circular features found in many of the filters.
To get a better understanding of how these filters provide discrimination, we mimic the operation in the first layer of the network by convolving each filter with average of large samples of signal and background jet images.The difference between the convolved average signal and background jetimages helps to provide an understanding of what difference in features the network learns at the first layer in order to help discriminate.
More formally, let represent the average signal and background jet over a sample, where J (i) is the ith jet image.In addition, we can select a filter w i ∈ R 11×11 from the first convolutional layer.We then examine the differences in the post convolution layer by computing: where * is the convolution operator.We arrange these new "convolved jet-images" in a grid, and show in red regions where signal has a stronger representation, and in blue where background has a stronger representation.In Figure 9b, we show the convolved differences described above, where each (i, j) image is the representation under the (i, j) convolutional filter.We note the existence of interesting patterns around the regions where the leading and subleading subjets are expected to be.We also draw attention to the fact that there is a large diversity in the the convolved representations, indicating that the DNN is able to learn and pick up on multiple features that are descriptive.A related way to visualize the information learned by various nodes in the network is to consider the jet images which most activate a given node.Fig. 10 shows the average of the 500 jet images with the highest node activation for the last hidden layer of the MaxOut network (the layer before the classification layer).The first row of images in Fig. 10 show clear two-prong signal-like structure whereas the second and third rows show one-prong diffuse radiation patterns that are more backgroundlike.The remaining rows have a variety of ∆R distances between subjets and have a mix of background and signal-like features.

Physics in Deep Representations
To get a tangible and more intuitive understanding of what jet structures a DNN learns, we compute the correlation of the DNN output with each pixel of the jet-images.Specifically, let y be the DNN output, and consider the intensity of each pixel I ij in transformed (η, φ) space.We the construct an image, which we denote the deep correlation jet-image, where each pixel (i, j) is ρ Iij ,y , the Pearson Correlation Coefficient of the pixels intensity with the final DNN output, across images.While this this image does not give a direct view of the discriminating information learned within the network, it does provide a guide to how such information may be contained within the network.In Figure 11, we construct this deep correlation jet-image for both the ConvNet and the MaxOut networks.We can see that the location and energy of the subleading subjet, found at the bottom of the image, is highly correlated with the DNN output and important for identifying signal jet-images.In contrast, the information contained in the leading subjet, seen at (x, y) ∼ (0, 0) in the image, is not particularly correlated with the network output owing to the fact that both signal and background jets have high energy leading subjets.We also see asymmetric regions around both subjets that are correlated with the DNN output and is indicating the presence of additional radiation expected in the QCD background jets.Finally, a small negative correlation with the rest of the jet area is seen, indicating that radiation from the background jets is more likely to be observed in these regions.The exact function form of these distribution are not known, nor does it seem to describe exactly any known physics inspired variable.

Studies in the Uniform Phase Space
An important part of the investigation into what the neutral networks are learning beyond the standard physics features is to quantify the performance when these features are removed.This represents the unique information learned by the network.One way to remove the discrimination power from a given feature is to apply a transformation such that the marginal likelihood ratio is constant at unity.In other words, we derive event-by-event weights such that where f (X|Y ) is the probability density function of X given Y .This is done practically by binning the mass and τ 21 distributions and then assigning to each event a weight given by the inverse bin content corresponding to the jet mass and τ 21 of that particular event.Figure 12 shows the ROC curve for various features with this weighting scheme applied.By construction, τ 21 and the jet mass do not have any discrimination power between signal and background, evident by the fact that bkg = signal = the random guess line.However, the convolutional network that is trained inclusively (without the weights from Equation 5.3) does have some discrimination power when the weights from Equation 5.3 are applied.For a fixed signal efficiency, the overall performance is significantly degraded with respect to the un-weighted ROC curve in Figure 6, but the improvement over a random guess is significant.Interestingly, the network performance is significantly better in this re-weighted setting when the same weighting is applied during training (effort by the network is not needed to learn τ 21 , for instance).
The ConvNet and MaxOut procedures training inclusively have similar performance.
Figure 11 already suggested that information about colorflow is contributing to the performance of the tagger since the signal is a color singlet and the background is predominantly a color octet (gluon).The radiation pattern in the former case is expected to be concentrated between the subjets of the jet and in the latter case around the subjets.One variable designed [39] and recently shown [40] to be sensitive to the colorflow is the jet pull angle, θ P (j 1 , j 2 ) for jets j 1 and j 2 .The jet pull vector is given by v j p = 1 , where i runs over the jet's constituents and r i is the vector in (y, φ) that points from the jet axis to the constituent i.The pull angle θ P (j 1 , j 2 ) is the angle the pull vector of jet j 1 makes with respect to the vector in (y, φ) pointing from the j 1 jet axis to the j 2 jet axis.Note that θ P (j 1 , j 2 ) = θ P (j 2 , j 1 ) because the former uses the substructure of j 1 and the latter uses the substructure of j 2 .We adapt the pull angle to the case of large-radius trimmed jets by using the leading (J) and subleading (j) subjets.The red and blue dashed lines in Fig. 12 show that a significant fraction of the DNNs performance can be explained by colorflow information contained within the jet pull angles.However, especially for the network trained with the weights, the DNN performance is also significantly better than the jet pull angles.
One can gain intuition about the unique information learned by the network by studying the correlation of the network output and the pixel intensities with the Equation 5.3 weights applied.This is shown in Figure 13 with and without the weights applied during training.The two correlation plots are qualitatively similar, but the region to the right of the subjets is more enhanced when the weights are applied during the training.This suggests that information about radiation surrounding the subjets contains important discrimination power contributing to the network's unique information.

Studies in the Highly Restricted Phase Space
Another way to quantify the unique information learned by the network that also provides useful information about physical information learned by the network is to restrict the considered phase space such that τ 21 and the jet mass distributions do not vary appreciably over the reduced space.Figure 14 shows the average signal and background jet image in three small windows of τ 21 , jet mass, and jet p T .In all three windows, the jet mass is restricted to be between 79 GeV and 81 GeV and the jet p T is required to be in the interval [250,260] GeV.The three windows are then defined by their value of τ 21 : [0.19,0.21] in the most two-prong-like case, [0.39,0.41] in a region with likelihood ratio near unity and [0.59,0.61] in a mostly one-prong-like case.The key physics features of the jets falling in these windows are easily visualized from the average jet images.The most striking observation is that in these three windows, signal jets look very similar to background jets.When τ 21 ∈ [0.19, 0.21], both signal and background jets have a second subjet that is distinct from the leading subjet, which becomes less prominent as the value of τ 21 increases.
The differences between images in these small windows tells us about what information could be learned by the networks beyond τ 21 and the jet mass.Since the differences are subtle, the average difference is explicitly computed and plotted in Figure 15 for the three narrow windows of τ 21 .In the window with τ 21 ∈[0.19,0.21],there are five features: a localized blue patch in the bottom center, a localized red patch just above that, a red diffuse region between the red patch and the center and then a blue dot just left of center surrounded by a red shell to the right.Each of these have a physics meaning: the lower two localized patches give information about the orientation of the second subjet (∆R) which is slightly wider for the QCD jets which need a slightly wider angle to satisfy the mass requirement.The red diffuse region just above the localized patches is likely an indication of colorflow as introduced earlier: the W bosons are color singlets compared to the color octet gluon jet background, and thus we expect the radiation pattern to be mostly between the two subjets for the W .One can draw similar conclusions for all the features in each of the plots in Figure 15.Now, we turn back to the neutral network and their performance in these small windows of jet mass and τ 21 .Figure 16 shows three ROC curves in the window τ 21 ∈[0.19,0.21].By construction, the τ 21 and jet mass curves are not much better than a random guess, since these variables do not significantly vary over the small window.The other curves show the performance of ∆R and the ConvNet and MaxOut neural networks trained inclusively, which have similar performance to each other.As in the previous section, this allows us to quantify the unique information in the neural network.Figure 16 also includes the jet pull angle introduced in the context of Fig. 12.As with the earlier figure, the jet pull angles do provide useful discriminating information in this small region of phase space, but cannot account for the entire performance from the DNNs.
One way to visualize the unique information is to look at the per-pixel correlation between the intensity and neural network output (Figure 17).The physical interpretation of the red and blue areas in Figure 17 are related to the colorflow of W and background jets.The area in-between the subjets should have more radiation than the area around and outside of the subjets for W jets and vice-versa for QCD jets.While Figure 17 is not directly the discriminant used in the network and only represents linear correlations with the network output, it does show non-linear spatial information and gives a sense of where in the image the network is looking for discriminating features.Some of this information is contained in the jet pull angles, but the DNN must be learning additional information (Fig. 16).

Outlook and Conclusions
Jet Images are a powerful paradigm for visualizing and classifying jets.We have shown that when applied directly to jet images, deep neural networks are a powerful tool for identifying boosted hadronically decaying W bosons from QCD multijet processes.These advanced Computer Vision algorithms outperform several known and highly discriminating engineered physics-inspired features such as the jet mass and n-subjettiness, τ 21 .Through a variety of studies, we have shown that some of these features are learned by the network.However, despite detailed studies to preserve the jet mass, this important variable seems to not be fully captured by the neural networks studied in this article.Understanding how to fully learn the jet mass is a goal of our future work.
In this paper, we propose several techniques for quantifying and visualizing the information learned by the DNNs, and connect these visualizations with physics properties.This is studied by removing the information from jet mass and τ 21 through a re-weighting or redaction of the phase space.In this way, we can evaluate the performance of the network beyond these features to quantify the unique information learned by the network.In addition to quantifying the amount of additional discrimination achieved by the network, we also show how the new information can be visualized through through the deep correlation jet image which displays the network output correlation with each input pixel.These visualizations are a powerful tool for understanding what the network is learning.In this case, colorflow patterns suggest that at least part of the unique information comes from the octet versus singlet nature of W bosons and gluon jets.However, not all of the information is contained in wellknown physically motivated color-flow-sensitive features like the jet pull angle.The visualizations may even be useful in the future for engineering other simple variables which may be able to match the performance of the neural network.
Both ATLAS and CMS have collected and will continue to collect large datasets filled with SM sources of boosted top quarks and W bosons.The collaborations have shown that event selections targeting these objects can be used to determine the systematic uncertainties of both simple and complex jet tagging techniques [9,[41][42][43].These techniques can be readily adapted for the jet images DNN tagger as a first step toward applying the tools developed in this paper to improve tagging performance in practice.Additionally, both ATLAS and CMS have achieved a better spatial resolution than their 0.1 × 0.1 hadronic calorimeter granularity.Figures 4 and 6 show that the DNN tagger presented in this paper significantly out-performs the unpixelated jet mass.The DNN tagger would do no worse than its stated performance with 0.1 × 0.1 granularity because one can always downsample the images before processing.With more information available to the network, it is likely the DNN tagger could do even better.Taking into account the non-uniform detector granularity in order to reduce the feature size is therefore an interesting direction of future work in adapting the methods presented here to a particular detector.
This edition of the study of jet images has built a new link between particle physics and computer vision by using state of the art deep neural networks for classifying high-dimensional high energy physics data.By processing the raw jet image pixels with these advanced techniques, we have shown that there is a great potential for jet classification.Many analyses at the LHC use boosted hadronically decaying bosons as probes of physics beyond the Standard Model and the methods presented in this paper have important implications for improving the sensitivity of these analyses.In addition to improving tagging capabilities, further studies with deep neural networks will help us discover new features to improve our understanding and improve upon existing features to fully capture the wealth of information inside jets.

A Image Sparsity
Figure 18 quantifies the sparsity of the jet images by showing the distribution of the pixel occupancy: the fraction of pixels that have a non-zero entry.Also plotted is the fraction of pixels that have at least 1% of the intensity of the scalar sum of the pixel intensities from all pixels.In general, the background has a more diffuse radiation pattern and thus the corresponding jet images have a higher average occupancy.

B Joint and Marginal Distributions
Figure 19 shows the marginal distributions of the network outputs for signal and background jets.The MaxOut network has a wavy feature in the distribution near 0.5 where the likelihood ratio is unity.In that regime, the network cannot differentiate between signal and background and in this particular case results in a non-smooth distribution at the fixed likelihood ratio value.
The joint distributions of the network with the jet mass, τ 21 , and the ∆R between subjets are shown in Fig. 20

Figure 1 :
Figure 1: The distributions of the jet mass (top left), τ 21 (top right) and the ∆R between subjets (bottom) for signal (blue) and background (red) jets.

Figure 2 :
Figure 2: The average jet image for signal W jets (top) and background QCD jets (bottom) before (left) and after (right) applying the rotation, re-pixelation, and inversion steps of the pre-processing.The average is taken over images of jets with 240 GeV < p T < 260 GeV and 65 GeV < mass < 95 GeV.

Figure 3 :
Figure 3:The distribution of the image mass after various states of pre-processing for signal jets (left) and background jets (right).The No pixelation line is the jet mass without any detector granularity and without any pre-processing.Only pixelation has only detector granularity but no pre-processing and all subsequent lines have this pixelation applied as well as translation to center the image at the origin.The translation is called naive when the energy is used as the pixel intensity instead of the pixel transverse momentum.Flip denotes the parity inversion operation and the p 2 T norm is a L 2 normalization scheme.The naive translation and the I 2 normalization image masses are both multiplied by constants so that the centers of the distribution are roughly in the same location as for the other distributions.

Figure 4 :
Figure 4:  The tradeoff between W boson (signal) jet efficiency and inverse QCD (background) efficiency for various pre-processing algorithms applied to the jet (images).The No pixelation line is the jet mass without any detector granularity and without any pre-processing.Only pixelation has only detector granularity but no pre-processing and all subsequent lines have this pixelation applied as well as translation to center the image at the origin.The translation is called naive when the energy is used as the pixel intensity instead of the pixel transverse momentum.Flip denotes the parity inversion operation and the p 2 T norm is a L 2 normalization scheme.
have the learned convolutional filters (left) and the difference in between the average signal and background image after applying the learned convolutional filters (right).This novel difference-visualization technique helps understand what the network learns.
RepeatApply deep learning techniques on jet images![3] convolutional nets are a standard image processing technique; also consider maxout

Figure 5 :
Figure 5: The convolution neural network concept as applied to jet-images.

3 .
Highly Restricted Phase Space: The phase space of mass, n-subjettiness, and p T are restricted to very small windows of size: m ∈ [79, 81] GeV, p T ∈ [250, 255] GeV, and τ 21 ∈ [0.19, 0.21].No weighting (beyond the p T weighted described in Section 2) is performed, and the networks trained in the General Phase Space are used for discrimination and evaluation.

Figure 6 :
Figure 6: Left: ROC curves for individual physics-motivated features as well as three deep neural network discriminants.Right: the DNNs are compared with pairwise combinations of the physicsmotivated features.

Figure 7 :
Figure 7: ROC curves that combined the DNN outputs with physics motivated features for the Convnet (left) and MaxOut (right) architectures.

Figure 8 :
Figure 8: Network output versus mass (left), ∆R (middle), and τ 21 (right) for the ConvNet network (MaxOut distributions are similar).Each row is normalized and represents the probability distribution of the variable shown on the x-axis given the network output.

Figure 11 :
Figure 11: Per-pixel linear correlation with DNN output for the Convnet (left) and the MaxOut network (right).Signal and background jets are combined.

Figure 12 :
Figure 12: Various ROC curves with event weights that enforce Eq. 5.3 inside m ∈ [65, 95] GeV, p T ∈ [250, 300] GeV, and τ 21 ∈ [0.2, 0.8].By construction, the τ 21 and likelihood combination of τ 21 and mass are non-discriminating (and are thus equal to a random guess).The ConvNet, MaxOut, and MaxOut-Norm networks are trained without the weights applied and the MaxOut (weighted) line was trained with the weights applied during training.

[
Transformed] Azimuthal Angle (φ) Correlation of Deep Network output with pixel activations.

Figure 13 :
Figure 13: Pearson Correlation Coefficient for pixel intensity and the convolutional neural network output for W → W Z and QCD (combined) for the MaxOut network training inclusively and then weighted (left) and for the MaxOut network training with the weights from Equation 5.3 applied also during the training.

Figure 18 :
Figure 18:The distribution of the fraction of pixels (occupancy) that have a nonzero entry (blue) or at least 1% of the scalar sum of the pixel intensities from all pixels (red).

Figure 19 :Figure 20 :
Figure19shows the marginal distributions of the network outputs for signal and background jets.The MaxOut network has a wavy feature in the distribution near 0.5 where the likelihood ratio is unity.In that regime, the network cannot differentiate between signal and background and in this particular case results in a non-smooth distribution at the fixed likelihood ratio value.The joint distributions of the network with the jet mass, τ 21 , and the ∆R between subjets are shown in Fig.20, Fig.21, and Fig.22, respectively.The joint distributions between the various combinations of the physics features are shown in Fig.23and Fig.24.

Figure 21 :
Figure 21: The joint probability distribution between τ 21 and the ConvNet (left) and MaxOut (right) network outputs for the background.

Figure 22 :
Figure 22: The joint probability distribution between the ∆R between subjets and the ConvNet (left) and MaxOut (right) network outputs for the background.

Figure 23 :Figure 24 :
Figure 23: The joint probability distribution between jet mass and the ∆R between subjets (left) and τ 21 (right) for the background.