Deep Learning for the Classification of Quenched Jets

An important aspect of the study of Quark-Gluon Plasma (QGP) in ultra-relativistic collisions of heavy ions is the ability to identify, in experimental data, a subset of the jets that were strongly modified by the interaction with the QGP. In this work, we propose studying deep learning techniques for this purpose. Samples of $Z+$jet events were simulated in vacuum and medium and used to train deep neural networks with the objective of discriminating between medium- and vacuum-like jets. Dedicated Convolutional Neural Networks, Dense Neural Networks and Recurrent Neural Networks were developed and trained, and their performance was studied. Our results show the potential of these techniques for the identification of jet quenching effects induced by the presence of the QGP.


Introduction
The experimental observation [1,2] of suppression of high transverse momentum hadrons in ultra-relativistic collisions of heavy ions relative to appropriately scaled proton-proton collisions was a major step in establishing that a novel state of matter, a Quark-Gluon Plasma (QGP), is produced in heavy-ion (HI) collisions. The ability to systematically reconstruct full jets in the presence of the large and fluctuating underlying event of HI collisions, first at the LHC [3][4][5] and later at RHIC [6,7], significantly expanded the scope of the use of jets as tools to understand the inner workings of the QGP they develop within. Studies of QGP-induced jet modifications, commonly referred to as jet quenching, were initially based on global jet properties (e.g. jet total transverse momentum) but have since evolved into detailed studies of increasingly complex observables [8,9] in the most part related to the internal structure of jets (their sub-structure).
While this program has significantly advanced the understanding of the dynamics of jet-QGP interaction, a fundamental difficulty underlies all but a few jet quenching studies. This difficulty can be illustrated by noting that in HI collisions a set of jets with total reconstructed transverse momentum p T within a given range includes jets that have experienced different levels of modification, that is jets that have been quenched to diverse extents. Together with the steeply falling spectrum for jet production, this makes any HI jet sample within any given p T range to be dominated by jets that experienced little or no modification. As such, quenching effects may present themselves as subtle modifications not because quenching is a small effect overall, but rather because jets that were significantly modified are diluted within a sample dominated by those with little modification.
The mitigation of this difficulty requires the ability to compare jets that were born alike rather than those that were detected with the same final reconstructed p T allowing for direct assessment of the modifications experienced by jets.
A path towards such mitigation involves the analysis of jets produced back-to-back with electroweak bosons (γ, Z or W) [10][11][12][13], as in this case the p T of the electroweak boson provides a close proxy to the p T of the parton from which the jet develops. However, these events have limited statistics.
Another possibility is to use data-driven procedures like the one proposed in [14], which allows for the determination of the average p T lost by jets being reconstructed in HI collisions with some final p T , but not of the fluctuations of that energy loss.
More recently, a novel reclustering jet algorithm was proposed to study jet quenching effects in HI collisions [15], allowing to identify jets that have different levels of QGP-induced modifications. This ability to identify within a jet sample a subset of the most modified jets is invaluable to augment the visibility of quenching effects and thus provide a cleaner slate to distinguish specific features of jet-QGP interaction.
In this work, we ask whether Deep Learning (DL) techniques can provide complementary criteria to distinguish strongly modified jets from essentially unmodified ones. A recent study [16] showed that a convolutional neural network (CNN) trained on jet images for jets modified using the strong/weak Hybrid model [17] including effects of medium response [18] allowed for the extraction, on a jet-by-jet basis, of the p T the jet would have had if no QGP was present. These important results rely, at least in significant part, on the presence of a medium response component which is the leading feature identified by the CNN as signalling the strength of quenching effects. However, the medium response remains the most model-dependent component of state-of-the-art [18][19][20] jet quenching simulations and is not inconceivable that features highlighted by the CNN may be model-specific and absent in real data.
Dedicated HI jet observables have been proposed as being more resilient to such medium response component [8], while being, simultaneously, calculable in a pQCD prescription [21]. Nonetheless, some of the currently available jet substructure measurements [22,23] do not enjoy this feature, and part of the unrelated underlying event can lead to effects very similar to those that can be argued to arise from medium response contribution to jets [24]. For these observables, the validation of a procedure that distinguishes quenched jets in HI collisions from those without any medium modification can only be made by comparing HI jets to vacuum jets embedded in the fully uncorrelated background (e.g. built with mixed event techniques [25] or by embedding in real PbPb events without jets [23]) with both samples undergoing the same analysis workflow including background subtraction.
In this work we ask a different, more fundamental, question: whether modifications imparted on jets by the QGP on a perturbative level, that is modifications of the parton shower, are sufficient for DL to attain a satisfactory discriminatory power.
The DL architectures considered in this work are trained without accounting for the medium response, thus maximizing the training on QCD in-medium emissions whose implementation in MC generators is solidly grounded in perturbative QCD. While some model dependence persists, since state-of-the-art jet quenching Monte Carlo event generators implement QGP-induced modifications in different ways, we believe it to be as small as presently possible. The paper is organised as follows: in Section 2, we present our simulation setup and the procedure to prepare the data samples used by the different DL architectures. The DL models used throughout this work and their training results are presented in Section 3. A careful analysis of the DL outputs and their possible interpretation to separate jet quenching effects is done in Section 4. The final conclusions are presented in Section 5.

Simulated data
To understand if Deep Learning can be applied to identify jet quenching effects we will use JEWEL v2.2.0 [19], a Monte Carlo event generator that accounts for medium-induced effects during the QCD parton shower evolution. Since the main goal is to identify medium-induced modifications to the parton shower structure, medium recoils (the particular implementation of QGP response in JEWEL) are not considered in this work.
We use the simple, parametrised medium described in detail in [19] with settings tuned to T = 440 MeV and τ i = 0.4 fm, while the remaining parameters were set to the default values. These are known to reproduce current jet energy loss experimental observations even without medium recoil effects [26]. From 10 6 weighted Z+jet hadronic events at a √ s N N = 5.02 TeV whose particles are required to have a minimum transverse 1 momentum of p min T,part = 500 MeV, we reconstruct the Z-boson from the pair of muons that result into a reconstructed object with a minimum transverse momentum of p min T,Z = 90 GeV and mass m Z ∈ [75; 105] GeV. The anti-k T [27] reconstructed jet with R = 0.5 and minimum transverse momentum of p min T,jet = 30 GeV is required to be within an azimuthal angle with respect to the reconstructed Z-boson of |∆φ| = |φ Z − φ jet | > 7π/8 and to have an absolute pseudo-rapidity of |η jet | < 1.0 to avoid projection effects in the resulting jet image. All jet reconstruction procedures were performed within the FastJet package [28].
The resulting transverse momentum ratio between the jet and Z-boson, x jZ = p T,jet /p T,Z , p T,jet and jet multiplicity, n const are shown in fig. 1 and fig. 2 for PYTHIA+JEWEL (Vacuum, without jet quenching effects) in orange and PYTHIA+JEWEL (Medium, with jet quenching effects) in blue. In proton-proton collisions, the transverse momentum ratio, x jZ is naturally peaked at 1 with a spread towards small and larger values. The former is a consequence of not being able to fully recover the energy due to the finite radial extent of the jet and events where more than one jet is reconstructed 2 , while the latter comes mainly from initial-state radiation contamination. Since the Z-boson and its decay prod- 1 The transverse plane is defined with respect to the colliding beams axis. 2 JEWEL uses LO hard matrix elements and thus does not generate Z + 2 jets at hard matrix element level, the parton shower generates configurations where the initial parton radiates sufficiently hard and wide for the end result being 2 reconstructed jet. ucts (muons) are colourless, they will not undergo any modification when medium effects are introduced. So, they provide a good proxy for the initial momentum of the jet-initiating parton. Nonetheless, the recoiling jet will experience several scattering processes, inducing extra radiation that is emitted at finite angles. While part of this radiation stays inside the jet under the form of softer fragments, collisional energy loss contributes further to the depletion of the reconstructed transverse momentum p T,jet (and x jZ ) and effectively reduces the number of particles n const since they are transported up to large radial distances in (η, φ) [29]. As such, while in vacuum there is a large correlation between p T,jet and p T,Z , as shown in the left panel of fig. 3, the left shift on the x jZ medium distribution partially destroys the correlation between the boson and jet transverse momenta (right panel of fig. 3) In addition, the criterion on the minimum jet transverse momentum also induces a selection bias on the medium sample: pairs whose recoiling jet is below the cut-off will not be included. These configurations are usually dominated by jets with a larger number of constituents and a wider fragmentation pattern. As a result, the medium sample will be dominated by jets with a narrower fragmentation pattern, which in turn are naturally present in the vacuum sample.
The p T,jet and n const show significant differences between vacuum and medium samples (see fig. 2). These will be used as input information to the training of some of the DL models used in this work. For such networks, we expect that the discriminating power will be significantly correlated to these variables. Nonetheless, the x jZ variable will not be included in any of the DL architectures. Since this is a powerful discriminant in itself, we preserve it as a physical benchmark against which to compare the different DL outputs.

Data representations
The simulated data used in the present study was prepared in different formats, each representing the jets in a specific way, which encodes the information with different implicit biases. We explore three main jet representations: calorimeter images, Lund plane coordinates, and jet-wise p T,jet and n const . Each representation of the jet carries different implicit features that are more suitable to study different substructure aspects of jet quenching.

Jet images
The jet-image consists of displaying the transverse momentum and multiplicity of the jet constituents mimicking calorimeter towers. As such, the jet particles are drawn in a (∆η, ∆φ) grid composed of 35 × 35 cells centred in the jet axis. Each cell will have two channels, where the first contains the accumulated transverse momentum of the particles contained in that cell while the second channel contains the particle multiplicity. When summing over of all cell's content we recover the jet p T,jet and n const . This type of information contains, in principle, all possible angularity-type of variables [30]. The usage of calorimeter images with CNNs have been explored previously [31,32] in the context of the classification between jets initiated by quarks and jets initiated by gluons, both in proton-proton and heavy-ion collisions. We work with two different types of jet images. In the first case, unnormalised, we use the absolute values of the p T and multiplicity of each cell, while in the second approach, normalised, each channel is normalised by the sum of its entries, i.e., the p T,jet and n const . The purpose of this is to have a comparison in performance between a DL network that has access to the whole information, including the scale of p T,jet and n const , and one that only has access to the relative fragmentation pattern in (∆φ, ∆η).
In fig. 4 we present both channels, the relative (normalized) p T,jet and n const , of the mean image of each sample subtracted by the mean image of both samples, defined as where X stands for the channel being shown, E is the expected value, and V and M representing the Vacuum and Medium samples, respectively. As we can see, the differences against the mean normalied image of both samples are very nuanced for both Vacuum and Medium samples. However, we do observe that the central pixel has, on average, a smaller value for the Vacuum sample than for the medium sample for both channels, signalling a narrower jet selection bias.
In fig. 5 we show the same image for the unnormalised case. Here, the differences between Vacuum and Medium are more noticeable, highlighting the expectation that providing the absolute scale of both p T,jet and n const will facilitate discrimination between both samples. We also see that the distribution of momentum and multiplicity inside of jets in the Medium sample is typically more suppressed with respect to the Vacuum sample. This observation is in agreement with the energy loss mechanism implemented within JEWEL, and the effect Normalised Images -p T distribution  on the selection bias induced by the jet p T cut, discussed previously. Overall, jets with a narrower fragmentation pattern are more likely to survive in the presence of energy loss effects.

Lund plane coordinates
The second jet representation considered is the primary sequence of Lund plane coordinates of the Cambridge-Aachen (C/A) jet clustering sequence [33]. To produce these, the jet is reclustered with the C/A clustering algorithm. The unclustering sequence, at each step, will result into two sub-jets with transverse momentum p T,1 and p T,2 respectively, from which we obtain the (log k T , − log ∆R) coordinates 3 . The procedure goes iteratively along the primary (the hardest) branch. Since the C/A clustering algorithm produces a branching history that is angle-ordered, we retain the order of these splittings. Therefore, the jets in this format are represented by a N × 2 matrix, where N is the number of branches in the Unnormalised Images -p T distribution  final clustering tree. The reclustering of the jets was performed in FastJet [28].
The average representation of these primary jet Lund planes obtained from the JEWEL+ PYTHIA Vacuum (Medium) samples is shown in fig. 6 left (right). The diagonal lines with negative slope represent the kinematic cut of having a sub-jet with p min T,part ≤ k T ≤ p T,jet /2. We also considered other clustering algorithms, in particular, the τ algorithm as proposed in [15], different settings of grooming using the Soft-Drop procedure [34], and different Lund plane representations and coordinates. To settle for the coordinates of the primary jet Lund planes obtained with C/A re-clustering without Soft-Drop, we performed a preliminary analysis using a non-optimised DL model to assess the dependence of its performance on these different combinations. We found that all DL networks performed similarly, and we fixed the representation that is presented above.

Tabular data -global p T,jet and n const
The final jet representation corresponds to tabular data containing p T,jet and n const per jet. The purpose of this representation is to quantify the discriminating power of these two variables alone. Two of the representations above have information on both the jet p T,jet and its number of constituents: the unnormalised images and the Lund plane coordinate sequences. As such, we will produce a DL discriminant using only these two variables so that we can compare how much the implicit jet substructure information in the images and Lund plane coordinates improves the performance over the information on the absolute scale of these variables.

Deep Learning for jet quenching classification
Deep Learning provides an array of versatile models capable of performing a wide range of tasks. In addition, their capacity to learn over different data formats, including highly unstructured formats such as images, allows us to train intelligent systems in data that have not been considered before. Indeed, it is the capacity of DL models to abstract the relevant features from unstructured data that is driving many of the novel and cutting-edge DL applications.
In light of this, we developed three different architectures that can take the most out of the data representations that we have discussed above. These architectures were used to develop classifiers with the purpose of discriminating between vacuum and mediummodified jets, with each making use of the different implicit features in the simulated data representations: • Images: Convolutional Neural Network (CNN) for the jet (η, φ) images. In addition, we further considered the case that the image channels were normalised or left unnormalised. Schematically represented in fig. 7.
• Lund: Recurrent Neural Network (RNN) for the sequence of the C/A re-clustered sequence of the primary Lund plane coordinates. Schematically represented in fig. 8.
• Global: Dense Neural Network (DNN) for the tabular data of the global jet transverse momentum and the number of constituents, (p T,jet , n const ).
For the jet-image representation, we use the CNN, which is the customary architecture for image-type of data, i.e. for grid data with highly correlated localised densities (the pixels) that produce larger hierarchical relations (the textures and shapes) that also benefit from composition, which is independent of the absolute coordinates in the grid.
For the Lund plane coordinates, we produced an RNN. RNNs are sensitive to the causal order of sequential steps, for example, those also appearing in natural language or audio. Since the C/A sequence entails a physically motivated ordering in angles, we exploit this structure by using an RNN.
Finally, the Global DNN serves to set the baseline discriminating power present in the variables (p T,jet , n const ) in order to disentangle the contribution of these variables from the substructure variables that the remaining networks will learn to perform the same task.
All models were developed with TensorFlow 2.3 [35], using its internal Keras API [36]. The data samples were randomly split into train, validation and test sets in a 1:1:1 proportion. This guarantees similar statistical representation for model training, selection, and physical discussion of the results. The importance of retaining the same statistical representation at each stage is understood as follows: during model training, the neural network weights are updated through the successive application of gradient descent steps which are calculated on mini-batch averages of gradients, for which having a good statistical representation is crucial to avoid biases towards kinematic regions with greater Monte Carlo statistical errors in the training set; during model selection, which is discussed in the next section, through hyper-parameter optimisation, we compare performance metrics of trained models to find the best one and, to prevent selecting a biased model, we require the validation set to have a good statistical description of the data; finally, in the application phase where we perform the analysis using the trained models, we want to have as good statistics as possible such that conclusions are statistically sound. Since we want to maximise the statistics of each of the three sets, the best solution is to split them equally as the methodology as a whole will only be as robust as the weakest link. Furthermore, at each stage, the Monte Carlo weights were used to enforce the statistical description of the kinematic distributions.

Hyperparameter optimisation
A crucial step in any DL application is the optimisation of the so-called hyperparameters of the model, which are the non-trainable parameters that specify the details of the architecture and its training. For each of the four cases, we performed a hyper-parameter optimisation loop using optuna [37] to tune the details of the architectures. The search space for each architecture is shown in table 1. We allocated a budget of 50 trials or 12 hours, whatever came first, per optuna loop and we evaluated the performance of each hyper-parameter combination using the validation set.    In addition, EarlyStopping (with patience of 10 epochs) and ModelCheckpoint callbacks were used during training to keep the network weights of the best model across all trials. To focus on the most promising combinations, we also employed the MedianPruner pruner (with 10 warm-up epochs and evaluated every 5 epochs henceforth), which interrupts a training that does not perform better than the insofar median of the validation loss. The hyper-parameters were sampled using the built-in Tree Parzen Estimator [38], with the multivariate flag set to true.
Furthermore, in the same loop, we also optimised the details of the learning rate scheduler used during training. For all the cases the Adam optimiser [39] was chosen, with an ExponentialCyclicalLearningRate schedule as implemented by TensorFlowAddons 0.11 with initial_learning_rate=1e-5, maximal_learning_rate=1e-2, scale_mode="cycle", step_size set to half the total number of batches, and gamma to be optimised by the optuna loop in the range [0.925, 0.975] in steps of 0.005. For all cases, the batch size was set to 1024, as well as a maximum of 100 epochs. Both the hyper-parameter bounds in table 1 and the training details discussed above were defined after an initial round of manual trials to determine reasonable configurations within the hardware and time constraints. The final best hyperparameters are shown in table 2, where we observe that no optimal configuration is set at the boundaries defined in table 1, which reinforces the initial choice of the hyperparameter space.
In  For the CNN architecture, the values of the Kernel Size, Stride, and Padding fix the maximum depth to 4 for 35 × 35 images without the use of pooling layers after the convolutions, making this architecture purely convolutional. Consequently, the network does not lose information from the data through pooling operations and can still progressively reduce the representation size (c.f. fig. 7) to the point where the output of the last convolution is already of N filters of dimension 1. In turn, this means that the head of the network is a linear classifier over patterns learned by the filters, which will allow us to better interpret what the network learned to perform the classification, as further discussed in appendix B. Finally, the usage of LeakyReLU and Batch Normalisation are standard recommended practices for Deep Neural Networks.
For the RNN architecture, we fixed the recurrent unit to be the Gated Recurrent Unit (GRU). In early experiments, we did not observe any variation in performance between GRU and the Long Short-Term Memory (LSTM) unit, for which we fixed the choice on GRU before the optuna loop as this unit has fewer parameters than the LSTM. In addition, we did not optimise the inner hyper-parameters of the GRU since only a few combinations allow for Keras optimised CUDA implementation that significantly increases the training speed. No inter-layer Batch Normalisation or Dropout was used as it is common in RNN since these are meant to be applied on the outputs of layers, whereas in an RNN the learning process step is performed step-wise across the sequence jointly across all layers.
Finally, for the DNN architecture, the implementation of LeakyReLU and Batch Normalisation was fixed to simplify the hyperparameter optimisation loop and to allow for deep, i.e. many layers, configurations.

Performance of the Deep Learning Architectures
The outputs of the DL networks are shown in fig. 9 for the validation data set. During network training, the Vacuum sample is identified with a true target value of 0 and the Medium sample with 1. Thus, the distribution of the predicted labels should be closer to 1 for jets obtained from the Medium sample and closer to 0 for jets obtained from the Vacuum simulation. This is observed for all DL architectures.  The final goal of these classifiers is to identify jets that experienced strong jet quenching effects. However, the Medium sample does not yield a pure sample of medium-modified jets, containing also a collection of reconstructed jets that, probabilistically, did not experience strong energy loss modifications (events for which x jZ ∼ 1). Nevertheless, while learning to distinguish between the Vacuum and Medium samples, part of the network will learn the effects of jet quenching on each data representation type. At the same time, this fact limits the capacity of the models to discern between the pure vacuum-like jets (proton-proton collisions) and medium-like jets (whose fragmentation pattern was modified by the presence of in-medium scatterings and in-medium radiation).
The represented in fig. 10, where the area under the ROC curve (AUC) is also reported. The CNN for normalised images has the poorer AUC, 0.67, while the remaining models achieve an AUC around 0.74. This is an indication that the jet absolute p T and number of constituents play an important role on distinguishing between the Vacuum and Medium samples. In Section 4, we further investigate the outputs provided by the DL architectures to understand if the two classes of jets identified by the networks are compatible with the desired mediumversus vacuum-like jets separation.
Moreover, in table 3, we also present the AUCs obtained for the different DL models over the same samples after performing a p T > 125 GeV cut. The reason to do this is that by increasing the minimum p T,jet , while keeping the same cut on p T,Z , we are discarding most of the events with p T,Z <125 GeV on both samples (the few vacuum events that will pass this cut will be the ones with a large ISR contamination; in the presence of a medium, those will fall below the cut). Most of the selected events will then have a Z-boson with a p T,Z that is near the momentum threshold for the jet. As such, while jet quenching effects will still be present, the magnitude of those will be highly reduced by definition, since those should come from the high end of the p T distribution. We observe that the AUCs obtained with the DNN, RNN and CNN with unnormalised images decrease around 10% for jets with p T >125 GeV, where the p T spectra are identical between the medium and vacuum

Results and interpretation of the Deep Learning architectures
In order to investigate how the DL networks separate between jets reconstructed from the Vacuum and Medium sample, we plot the predicted DL outputs versus x jZ in fig. 11. Simultaneously, since x jZ is a good proxy for the quenching phenomenon at the jet level, this allows evaluating the potential of the networks for a jet quenching tagging application. The outputs of the different DL architectures are nearly uncorrelated with x jZ for vacuum (see appendix A), which is a desired property for the tagger since events for which x jZ differs from 1 in the vacuum result from spurious effects, independent of jet quenching through interaction with the QGP. On the other hand, the DNN, RNN and CNN from unnormalised images have larger predictions for smaller values of x jZ , i.e. when the jet modification by the medium is also larger on average. Therefore, these networks are predicting better the labels of jets which are quenched and misidentifying as vacuum jets with lower x jZ , effectively behaving as a jet quenching classifier. Using normalised images, the CNN seems only slightly correlated with x jZ , which means that in principle the decision boundary of the model is not the most adequate for tagging quenched jets. Furthermore, in appendix A, we inspect the correlations between the DL discriminants.
To test the results of the different architectures, we created two samples of medium-like and vacuum-like jets as identified by the output of each DL network. On both samples generated by JEWEL+PYTHIA (Vacuum and Medium), we classified jets as quenched (if the DL discriminant was above a given reference value) or vacuum (if the result was below). This reference value was not optimised and it was chosen for illustration purposes only. Taking the results of fig. 9, we set this reference cut to 0.7 except for the CNN trained on normalised images, which was set to 0.6. A comparison of the resulting Z-boson spectra contrasting the Monte Carlo truth is shown in fig. 12. We kept the solid lines representing the Vacuum (orange) and Medium (blue) simulations withdrawn from JEWEL+PYTHIA, while the open symbols reflect the selection identified by each network as being Vacuum (orange) and Medium (blue). In all DL architectures, it is possible to fully recover the vacuum expectations, a good indication of the ability of DL to identify vacuum-like jets. We note that a better agreement of the vacuum-like to the Vacuum sample could in principle be achieved by optimizing the reference cut. Nevertheless, our purpose is to distinguish medium-from vacuum-like jets. We thus expect some events from the Medium sample to be identified by the DL models as vacuum-like, thus slightly distorting the resulting distributions. The medium-like p T,Z spectra obtained through DL selection is always suppressed with respect to the Medium sample, thus indicating that all DL models are making a separation that does not follow our Vacuum vs Medium simulations. To confirm if the DL classifiers were not misidentifying medium-like jets out of the JEWEL+PYTHIA Vacuum simulation, we checked the percentage of the test events categorised as being medium-like. This amounts to 9% for the CNN trained on unnormalised images, 11% for the Global DNN, 13% for the RNN, and 18% for the CNN trained on normalised images, thus confirming that, overall, the obtained DL discriminants can correctly identify jets whose fragmentation pattern follows the same as vacuum physics. The CNN trained on normalised images is the one where misidentification can potentially impact the interpretation of the results. It is well known that the p T,jet is a fairly good discriminant of non-quenching. Selecting higher transverse momentum jets (higher p T,Z ) will likely bias our sample towards lower fragmentation patterns, and as such, subject to smaller energy loss effects. Since this quantity is related to the p T,Z , if this information was used by the DL network as a discriminant, we expect a different p T,Z dependence from the Monte Carlo truth. From fig. 12, we observed that the resulting transverse momentum dependence of the medium-like jets provided by the DL architecture vary between them. The Lund sequences (RNN) are the ones that show a larger p T,Z dependence, followed by unnormalised images (CNN) and the global information (DNN). By construction, the CNN trained on normalised images shows a very weak dependence on the p T,Z . This is in agreement with the results in Table 3, where the performance of the Global DNN was the most affected when increasing the minimum cut on p T,jet . Moreover, we also see that the RNN and unnormalised CNN are using additional information from the jet fragmentation pattern since they have a different p T,Z dependence when compared to the Global DNN, trained solely on p T,jet and n const . We now proceed to analyse the transverse momentum imbalance of the medium-and vacuum-like event sample. This observable is only sensitive to the fraction of transverse momentum that is captured inside the jet area, with respect to the p T,Z . However, energy loss induced by jet quenching effects is associated with a change in the fragmentation pattern of a jet. As such, we expect some differences between the Global and the DL architectures that do use clustering information as input. On the opposite end, we have the CNN trained on normalised images whose input is the relative differences in the fragmentation pattern. The resulting distributions are illustrated in fig. 13, keeping the same symbol (open symbols -DL output; full symbols -JEWEL+PYTHIA) and colour notation (orange -Vacuum; blue -Medium) as before. Clearly, the CNN that is trained on normalised images shows the most different results. It is not able to recover so well the x jZ distribution obtained from the Vacuum sample (the Global DNN seems to excel in this sense) and the x jZ of medium-like jets is also more flat when compared to the Medium sample. The distribution of the output of this network ( fig. 9) for the Medium and Vacuum sample overlaps significantly, making it more difficult to select a suitable reference value. Nonetheless, the medium-like x jZ provided by this CNN seems to enhance medium-like features with respect to the medium sample as its x jZ distribution is displaced towards smaller values. Training only on jet-wise variables, such as the Global DNN, provides an excellent description of the vacuum x jZ . As expected, using p T,jet during the training helps to describe observables that are exclusively sensitive to energy loss effects. The medium-like x jZ provided by the Global DNN is shifted towards the left and has approximately the same shape as the Medium Monte Carlo truth. By using a more complete set of jet information -unnormalised images or Lund planes -we see that the medium-like distribution selected by the corresponding DL architectures is even more peaked at lower x jZ . The vacuum-like distribution is slightly displaced from the Monte Carlo truth Vacuum sample. This might also hint that these networks can identify jets in the Medium JEWEL+PYTHIA sample that did not experience major interactions with the  medium, thus categorizing them as vacuum-like jets. After checking the p T,Z dependence and the results on x jZ we move to observables that require information from jet substructure: the average jet radial profile, that keeps track of the number of particles in bins of ∆R inside of the jet, and the jet mass, m j , that weights distance and transverse momentum of the particles inside the jet.
The results for the average jet radial profile are shown in fig. 14. Overall, all DL architectures select the same type of vacuum-like pattern jets as the Vacuum sample, even though the resulting x jZ distribution can vary. The Global DNN, that did not receive information from the jet fragmentation during training, shows the same trend, but this can be a consequence of providing an exceptional good agreement on the highly peaked vacuum x jZ distribution. It is also possible to see that all but the Global DNN identify medium-like jets as being narrower than the ones within the Medium JEWEL+PYTHIA sample. Since the latter sample contains a mixture of different levels of quenching, it is thus expected that a more pure sample of medium-like jets will be even narrower. The details between the DNNs results differ, nonetheless. The CNN trained on normalised images is the one that shows the highest deviation because it is trained only on the relative fragmentation. It follows the Lund planes and unormalised jet images. We note that while the presence of jet quenching will induce a narrower average jet radial profile, the opposite is not necessarily verified. For this reason, the CNN trained on normalised images results into a more flat x jZ distribution despite showing a selection of very narrow jets. On the other hand, the DL networks exploring unnormalised images or Lund planes identify a not so narrow jet, but that indeed lost a significant amount of energy relative to its initial momentum (p T,Z ). The Global DNN, whose training did not contain any information on the jet substructure, still selects jets whose centre is depleted concerning the Medium sample. These jets are more evenly populated, and thus likely to contain medium-induced radiation that travelled along the jet direction. While retaining this energy, these jets continue to experience collisional energy loss as its absolute multiplicity continues to be smaller than the Medium Monte Carlo reference. Finally, the results on the jet mass are shown in fig. 15. As mentioned before, this observable keeps track of all jet input variables used in this training by definition. Thus, all DL architectures used in this study were given (partial) input about this observable, and as such, all of them are able to identify vacuum-like as being the same as the Monte Carlo Vacuum sample. Simultaneously medium-like jets show a jet mass that is smaller than the Medium sample. Nonetheless, we see that the two networks trained with all information (unormalised images and Lund planes) tag a jet population with smaller jet mass induced by jet quenching effects. As such, relative differences on the jet pattern alone or jet-wise variables (p T,jet and n const ) can be used to identify energy loss effects or differences in the jet fragmentation function, independently if a particular observable is insensitive to one effect or the other. However, it seems that alone they do not suffice. A combination of both seems to work better to emphasise jet quenching features across a wider range of observables. In this regard, both the CNN trained on unnormalised images (final state particles only) or the RNN in Lund planes (declustering information) seem to perform equally well. In appendix B we scrutinise further which jet features are the CNNs triggering on.

Conclusions
In this work we set out to explore how different DL architectures learn to discriminate between medium-like and vacuum-like jets. For this purpose, we used JEWEL as our Monte Carlo event generator to produced Vacuum and Medium Z+jet samples. The different architectures presented were chosen as to utilise different data representations of the jets: Convolutional Neural Networks for jet-images, Dense Neural Networks for jet-wise observables, and Recurrent Neural Networks for Lund plane paths. Since each data format carries different explicit and implicit information, this comparison allowed us to further understand how DL can help isolate medium-induced effects in jets.
By looking at how a DL-based classification affect the distributions of the jet observables, we observed that while all DL networks seem to identify the same vacuum-like distributions, they did not always produced the same medium-like distributions. More specifically, we observed that CNN on unnormalised images and the RNN on Lund plane paths identified medium-like jets to be more different from the results provided by the Monte Carlo (Medium sample). These two architectures have access to the absolute scale of the jet transverse momentum and to its number of constituents, but their discriminant power is not based on the two observables alone, since they outperform the Global DNN establishing the ground performance of the transverse momentum and multiplicity of constituents. As such, the result indicates that the CNN on unnormalised images and the RNN are also learning from the fragmentation pattern that will yield different jet profiles (CNN) as well as different jet Lund sequences (RNN).
The samples that include jet quenching effects were produced with JEWEL, a widely used jet quenching Monte Carlo event generator. While our results are biased towards this Monte Carlo truth, the agreement of this model over a wide range of jet observables [19,26] provides a robust baseline to establish the first step towards using Deep Learning techniques to identify in-medium modifications. Therefore, the effect of medium-induced recoils, the most model-dependent feature of current jet quenching descriptions, was neglected. As an outlook, we plan to use the most performant networks from this study (RNN and CNN for unnormalised images) in different jet quenching Monte Carlo event generators and with the presence of recoiling scattering centres. This will further probe the capability of the proposed DNN methodology to identify jet quenching effects induced by the presence of a QGP.

A Correlation between Deep Neural Networks
We inspect the bi-dimensional distributions of the outputs of the DL models for all pair combinations of models in fig. 16 and fig. 17, respectively for Vacuum and Medium. There is a strong linear correlation for the models which access the jet p T distribution, i.e. the global DNN, the RNN and the CNN trained on unnormalised images, providing evidence that the underlying common features being learnt by the models are the distributions of jet p T and number of constituents. While still existent, the correlation between the output of these models and the output of the CNN on normalised images is more faint in the Medium sample, which corroborates the conclusion. Moreover, the output of the CNNs for Vacuum are significantly correlated.

B Interpreting what the CNNs learnt
Whilst most of the DNN architectures function as black-boxes once trained, methodologies have been developed to help us better understand what CNNs learn during training. After analysing some jet observables in section 4, we will now proceed to further identify which particular jet feature are CNNs using to classify the jet as experiencing jet quenching effects.
The first of such methods is the one of finding the images that produce maximal activations for the filters. As it was earlier argued in section 3, the CNN architecture used in this work is such that the last convolutional layer produces an array of dimensionality N f ilter , i.e. a dense vector of the high-level representation of the data, which is then passed on to a linear classifier in the last layer, being the output of the entire network where σ is the sigmoid function, w the weights of the last layer, b the bias of the last layer, and represents the whole convolution process: where each CONV i represents a convolutional layer, c.f. fig. 7. Since the features produced by the last convolutional layer are afinely combined to produce the probability of belonging to the medium sample eq. (2), we know the contribution that each will impact the final score by inspecting the value of the weight in w that multiplies it. Furthermore, we can find an image with the patterns that maximise these final features: for each i filter. This method is known as GradCAM [40], as it makes use of gradient descent to maximise an image, which is initialised at random, to produce an interpretable pattern of the learned features.
In fig. 18 we produce the patterns for the maximal activations for the three most discriminant high-level features, i.e. the ones that have the largest and smallest associated weight w i in the final layer, for the CNN trained on normalised images. Negative (positive) values of the weights mean that the associated pattern is contributing to classify an image as vacuum (medium). We notice that for the patterns associated with vacuum discrimination, the CNN learnt to look for denser distributions of both transverse momentum and multiplicity across the grid cells, whereas for the medium sample it is looking for far more scattered patterns. In addition, we notice how many pixels have yellow hue, meaning that the network is looking at both channels jointly (otherwise the pixel would either be red or green, depending if the network focused solely on the p T or multiplicity channel respectively). The brighter the pixel (regardless if it is with respect to one or both channels) the more important it was to contribute to the activation of the filter.
In fig. 19 we produce the patterns for the maximal activations for the three most discriminant high-level features, for the CNN trained on unnormalised images. We observe completely different patterns than those learnt by the CNN on normalised images. More concretely, there is no discerning trend to prefer denser patterns for vacuum and scattered ones for medium. Furthermore, at each pixel it is focusing either on p T or n const , and the regions that it looks for the distribution of each are disjoint. This result is in agreement with fig. 12, where we observed that the output of the CNN on unnormalised images is sensitive to the scale of the p T of the Z being emitted in the hard scattering.
While the previous study helps us understanding patterns which, once convoluted with the image, affect the final classification score, it does not provide any insight on specific images. Therefore, we need another method that produces an explanation from the network on why it classified an example the way it did. This can be accomplished with a method called integrated gradients [41], which schematically works as follows. First initialise a random image, i.e. noise. Then, select a data image that has been properly classified, X, and produce N + 1 linearly interpolated images between that image and the noise image: This sequential interpolation approximates the morphing of the noise image into the data image. By computing the gradients of the prediction, i.e. the output of the neural network for the correct class, with respect to each x i we will know what parts of the data image were relevant for the classification. By integrating them over i using the trapezoidal rule,  Figure 19: The most discriminative patterns of the unnormalised images. Red represents the transverse momentum channel, green the multiplicity channel. Brighter pixels represent the pattern that maximised the activation of the respective filter. Since the top (bottom) filters have a negative (positive) weight, they are being triggered by vacuum (medium) sample jets.
we will obtain an explanation of the classification by isolating the pixels of the data image that contributed the most. We closely followed the implementation provided in the Keras website [42].
In fig. 20 we can observe the integrated gradients for three examples of images from the medium sample from CNN models. For the normalised images, the model is looking for the pixel with the highest p T and n const fractions, i.e. the bright yellow pixel on the input image, and it is looking at both channels of that pixel at the same point. On the otherhand, for the unnormalised images, the model is seemingly looking for the pixel with the most p T while jointly it learns the distribution of n const elsewhere in the image. These are in agreement with the most discriminating patterns above, where we saw that the while the normalised images model is jointly looking at both channels, i.e. the p T and n const spatial distributions, the unnormalised images model is focusing on these distributions at disjoint regions of the image.
Since we used a single sigmoid head for our models, in order to obtain the similar patterns for vacuum images we minimised, instead of maximising, the output of the CNN models for correctly classified vacuum images, instead of medium ones. The results are show in fig. 21. The obtained integrated gradients are complementary to the ones we saw for the medium case. However, it seems that normalised images model is seemingly looking at p T and n const disjointly just like thr normalised images model. This might indicate that for vacuum images both networks are learning somehow similar features, while their learnt features for medium differ. This is corroborated by fig. 16 where we see that both CNN are highly correlated in vacuum, but the same is not true for the medium fig. 17.