Deep learning jet modifications in heavy-ion collisions

Jet interactions in a hot QCD medium created in heavy-ion collisions are conventionally assessed by measuring the modification of the distributions of jet observables with respect to the proton-proton baseline. However, the steeply falling production spectrum introduces a strong bias toward small energy losses that obfuscates a direct interpretation of the impact of medium effects in the measured jet ensemble. Modern machine learning techniques offer the potential to tackle this issue on a jet-by-jet basis. In this paper, we employ a convolutional neural network (CNN) to diagnose such modifications from jet images where the training and validation is performed using the hybrid strong/weak coupling model. By analyzing measured jets in heavy-ion collisions, we extract the original jet transverse momentum, i.e., the transverse momentum of an identical jet that did not pass through a medium, in terms of an energy loss ratio. Despite many sources of fluctuations, we achieve good performance and put emphasis on the interpretability of our results. We observe that the angular distribution of soft particles in the jet cone and their relative contribution to the total jet energy contain significant discriminating power, which can be exploited to tailor observables that provide a good estimate of the energy loss ratio. With a well-predicted energy loss ratio, we study a set of jet observables to estimate their sensitivity to bias effects and reveal their medium modifications when compared to a more equivalent jet population, i.e., a set of jets with similar initial energy. Finally, we also show the potential of deep learning techniques in the analysis of the geometrical aspects of jet quenching such as the in-medium traversed length or the position of the hard scattering in the transverse plane, opening up new possibilities for tomographic studies.


Introduction
Jets are collimated sprays of hadrons and other particles which originate from the fragmentation of high energy quarks, gluons or highly boosted bosons. Due to color confinement, only colorless states can be observed in experiments. Thus, jets manifest fundamental and important signatures of the underlying quantum physics probed in collider experiments.
In relativistic heavy-ion collisions, jets are modified by elastic and inelastic processes that take place during their passage through the hot and dense debris of the collision [1][2][3][4].
Historically, the jet quenching phenomenon has been primarily attributed to the observed strong suppression of intermediate-p T hadrons at the Relativistic Heavy-Ion Collider (RHIC) [5,6] and years later via the dijet asymmetry and the suppression of high energy reconstructed jets at the Large Hadron Collider (LHC) [7][8][9][10][11][12][13][14]. More recently, considerable efforts have been devoted to measuring the modifications of the internal properties of jets, generally referred to as jet substructure measurements . Such observables provide new constraints for the theory and modelling assumptions. Jet quenching has arguably come to serve as one of the most powerful experimentally accessible probes of the properties of the hot, deconfined QCD medium produced in heavy-ion collisions [37]. The ability to properly correlate the level of medium modifications to intrinsic properties of individually reconstructed jets will help enhance the potential of these probes to accurately diagnose properties of hot QCD medium, provided that the mechanisms by which jets interact with the medium are under good theoretical control.
When studying modifications of jet observables in both in proton-proton and heavy-ion collisions, it is typical to select jets within a certain p T range or at least above a minimum p T . Most frequently, one presents the ratio of the distributions of such observables measured in the two colliding systems. However, such an imposed p T cut introduces a selection bias due to the steeply falling jet production spectrum, typically dσ jet /dp T ∝ p −n T , where n 5 at the LHC and even larger at lower collisions energies. It is indeed unlikely to observe jets above the p T cut which have lost a significant amount of energy, simply because such more quenched jets had to be produced at a higher p T , where the spectrum is increasingly suppressed. One strategy to mitigate part of these bias effects is to match the cumulative jet cross-sections in pp and AA collisions [38][39][40].
It is generally believed, both from theoretical arguments [41,42] and in most Monte Carlo implementations, that the momentum scales related to high-p T jet production are typically much larger than the local medium scales. This implies that hard radiation, occurring on short time-scales inside the jet, takes place independently of the medium. These initial emissions play an important role for the subsequent evolution of the jet, both considering the final jet properties and the amount of medium modifications that it can experience. One therefore expects that final, measurable jet properties, such as the jet width and fragmentation functions, correlate with the amount of energy that was lost in the medium. Considering at the same time that these properties could be modified during the passage through the medium, it becomes clear that the existence of the selection bias obscures the actual impact of medium effects in final observables and complicates the task of extracting robust information about medium properties.
In order to minimize selection biases and get a better handle on the medium effects, it would be desirable to discern whether a given jet has actually suffered energy loss, or even estimate the amount of energy loss it has experienced. This is a highly nontrivial task on a jet-by-jet basis. Even in vacuum, a jet population of a certain radius within a given momentum range can vary considerably due to the random nature of jet fragmentation. In the medium, randomly distributed path lengths through the medium introduce additional fluctuations. Both sources of fluctuations, adding on top of other kinds related, for example, to the process of hadronization, hinder our ability to identify the degree of medium modification of a specific jet in a heavy-ion collision. Unfolding these fluctuations to access information about the magnitude of jet-medium interactions for each individual reconstructed jet looks a priori as a daunting task.
Machine learning techniques have been widely applied in jet physics, such as QCD/W jet tagging [43][44][45][46][47], top tagging [48][49][50], quark/gluon jet discrimination [51,52] and heavy flavor classification [53]. Various neural networks have been employed in tackling these issues, including convolutional neural networks (CNNs) with jet image as input [44,45,50,51] -often supplemented by deep neural networks analyzing jet substructure observables for the purpose of interpretability [50,54] -, recurrent neural networks (RNNs) with the Primary Lund plane as input [46], recursive neural networks (RecNNs) with declustering history tree as input [47,52] and point cloud networks with an unordered set of jet constituent particles as input [55]. First attempts of using machine learning techniques for jet physics in heavy-ion collisions include distinguishing between quark and gluon energy loss [56]. These techniques match or outperform conventional physically-motivated features in the tasks above. It is therefore tempting to apply these machine learning techniques for diagnosing the jet quenching phenomenon on a jet-by-jet basis to help identify the features most sensitive to the process of energy loss.
In this work we explore the power of deep learning techniques and study its feasibility to extract the energy loss information that individual inclusive jets experience from final jet measurable properties. The problem is formulated as a regression task with the objective to find the hadronic energy loss ratio, i.e., the ratio of the transverse momentum of a hadronic jet after traversing a hot QCD medium over the transverse momentum of its "vacuum" equivalent jet. This is, the energy of an identical jet that has not propagated through the medium. With the predicted energy loss ratio, we classify jets from an in-medium sample in a range of categories spanning from strongly quenched to almost completely unmodified. We analyze a set of jet observables, including groomed observables, e.g., the groomed momentum sharing fraction z g , the groomed jet radius R g and Soft Drop multiplicity n SD , and also ungroomed ones, such as the jet shape and jet fragmentation functions. We show how some of the properties of the jets from the unquenched class do not necessarily converge to the ones from the jets in a vacuum sample. This is due to the effect of the selection bias, which affects more strongly those observables that more intimately relate to the amount of energy loss experienced. Finally, an exploratory discussion on jet tomography assisted by deep learning is discussed in the end.
Throughout this work, we put strong emphasis on interpreting our results by comparing various setups for extracting the energy loss ratio. This includes varying the inputs and network architectures. These additional attempts aid our understanding of what features of the jet allow the network to successfully extract key information. It turns out that the combination of hard, small-angle and soft, large-angle structures of the jet are crucial in obtaining the best performance. Furthermore, a main goal of our work is to examine the output of the machine learning in terms of human understandable variables. This goal is shared with the main body of classification studies in proton-proton collisions, which compare the performance between human defined observables [44,54], and a neural network. In heavy-ion collisions, in lieu of a theoretically well motivated observable, or set of observables, to gauge the effects of energy loss, we instead attempt to construct such observables by studying the sensitivity of the network. First, we define an energy fraction of hard particles in a jet and, second, a non-linear combination of inputs from the jet shape, which are explored in Sec. 3.4. Neither construct resulted in a sensitivity on the same level as that of the full neural network. Nevertheless, this constitutes a first attempt at providing physical intuition on the problem of jet energy loss from machine learning.
The rest of the paper is organized as follows: in Section 2 we first explain the energy loss model used, matching procedure between vacuum and medium jets and introduce definitions of the relevant physical quantities and jet observables. Then we describe the regression task and detail the network architectures. Last, we generate the jet samples and assign sample weights for unbiased training. Section 3 contains information about the pre-processing steps undertaken to use the jet image as an input to the deep neural network architecture. The correlations between jet energy loss and jet observables are briefly presented. Next, we present the prediction performance of the CNN and compare the results from various scenarios, with different inputs and different networks, to discuss the interpretability of the CNN's efficiency. Some first applications of our procedure are proposed in Section 4, where we study the response of jet observables to the amount of energy loss using two different jet selections, and also present the geometrical information that can be extracted due to its correlations with energy loss. Finally, Section 5 summarizes the results and discusses the path ahead.

General setup and main variables
This section introduces three main concepts. Firstly, we describe the particular Monte Carlo event generator that was used to generate the analyzed jet images. Secondly, we discuss the main physical observables that will be used for the analysis. Finally, we describe the machine learning frameworks used in this work.

Modeling energy loss using the hybrid model
Energetic partons produced in hard scatterings are created with a high initial virtuality, Q ∼ p T . The high virtuality is relaxed through successive splittings, as dictated by the DGLAP evolution equations. In the presence of a deconfined QCD medium, as the one created in heavy-ion collisions, these jets will interact with the degrees of freedom of the plasma, whose scale is typically characterised by the local temperature T . Given the wide scale separation between the two relevant scales of the jet-plasma system, i.e., Q T , one can to a good approximation factorise the high virtuality showering process from the interaction with the plasma at lower energies. This assumption has been used in the development of jet quenching Monte Carlos where the interaction with the medium is described at weak coupling, such as MARTINI [57], LBT [58] (both available within the JETSCAPE framework [59]), PYQUEN [60] and Saclay model [61], 1 and also in those where the interaction with the medium is strongly coupled, such as the hybrid strong/weak coupling model [64]. In the present work we will analyse data from the hybrid strong/weak coupling model, leaving the extension to more models for future work.
In the hybrid model, parton showering is described using the event generator PYTHIA8 [65], supplemented with the nuclear PDF modifications from EPS09 [66]. The spacetime picture of the parton shower is based on a formation time argument [67], such that each parton propagates through the QGP for a distance t f ≡ 2E/Q 2 , with E the energy of the parton and Q its virtuality. The shower is then embedded in a heavy-ion environment. First, one selects the initial position of the hard scattering in the transverse plane through an optical Glauber sampling. Local properties of the QGP necessary to describe the interaction, such as the temperature T (x) and fluid velocity u(x), are read from hydrodynamic profiles that describe the expansion and cooling down of the liquid QGP droplet [68]. The strongly coupled interaction is modelled using an energy loss rate obtained within gauge/gravity duality for N = 4 supersymmetric Yang-Mills theory at large N c [69,70], where E in is the initial energy of the parton and T is the local temperature of the plasma. The quantity κ sc , which depends on the 't Hooft coupling but whose precise expression varies depending on how the energetic parton is prepared in the holographic calculation, is taken as a free parameter that is fit to hadron and jet suppression data [71]. The results in Eq. (2.1) are derived in the local fluid rest frame. In order to take into account the effect from the flowing medium, we need to replace E in and x by their corresponding values in the local fluid rest frame, E F in and x F , which one can express in terms of the quantities in the laboratory frame as [72] where v ≡ p/E is the parton velocity, u and γ F are the fluid velocity and Lorentz factor, t 0 the time the parton was produced and t is the observation time. By following the branching history of a given parton j, we can compute the total length traversed through the plasma as where the sum runs over the parent history H j of the given parton j, while t i c is the time, since it was created, at which the parton i exits the QGP phase by encountering a temperature below the pseudo-critical temperature T c = 145 MeV. 2 2 In principle, there could also be quenching in the hadron gas phase, below Tc. This has so far been ignored by jet quenching models based on the general argument that hadrons take too long to form [73]. Nevertheless, there are studies that point to the importance of this phase in a variety of observables, specially for low pT particles (see, e.g., [74][75][76][77]), which are precisely the kind of hadrons that form the fastest and the ones more affected by further rescattering. The inclusion of these effects, whose implementation within the current state-of-the-art quenching models is still ongoing work, is left for the future.
The amount of energy and momentum lost by the energetic parton, as described by Eq. (2.1), exactly corresponds to the amount of energy and momentum flowing into the QGP hydrodynamic modes [70]. This will generate a wake that is correlated with the direction of the jet [78], whose contribution to the experimentally observable jet properties has to be taken into account. The hybrid model provides an estimate of the wake contribution to the final hadron spectrum by performing an expansion of the Cooper-Frye formula at the perturbed freeze-out hypersurface, which yields [79] where p T , m T , φ and y are the transverse momentum, transverse mass, azimuthal angle and rapidity of the emitted thermal particles and where ∆P T and ∆M T = ∆E/ cosh y j are the transverse momentum and transverse mass transferred from the jet, with azimuthal angle and rapidity φ j and y j , respectively. The distribution in Eq. (2.5) has been obtained by considering that the background behaves as a Bjorken flow, which only has longitudinal expansion. Generalizing it to the case in which there is transverse expansion can modify such distribution, depending on the orientation of the jet with respect to the background radial flow components [80][81][82]. The consequences of these observations will be explored in the near future. The partons that do not completely hydrodynamize are hadronized using the Lund string model included in PYTHIA8. The contributions from the hadrons of the wake, together with the fragmented hadrons, ensure event-by-event energy-momentum conservation. 3 2.2 Jet energy loss ratio χ jh and traversed path-length L The main goal of this work is to determine, on a jet-by-jet basis, the amount of energy loss, quantified through the variable suffered by jets due to the propagation through a hot and dense QCD medium. Here, the subscript "jh" refers to the energy of the jet measured at hadronic level. These jets the transverse plane. This reflects the absence of soft particles in such region of phase space compared to an unperturbed QGP background as a result to the boost experienced by the fluid cell due to the injection of momentum from the jet. In the present work we will ignore such negative contributions, since they would show up as negative energy pixels in the jet images used in Section 3.1 (one would need to devise a procedure to cancel out such negative contributions using particles from a real background which are close in momentum and configuration space, such as in [79], which we leave for future work). It has been shown that their contribution to jet observables with relatively small jet radius, such as the one used in the present work, R = 0.4, is almost negligible [83], which guarantees that none of our conclusions will be affected by the omission of such contribution. A study of jets with a larger radius will be done in future publications.
are reconstructed with FastJet [84] using the anti-k T algorithm [85] with reconstruction parameter R = 0.4. In this definition, E h f is the p T of a given jet in the presence of a medium, and E h i is the p T of the same jet had there been no medium. This relies on a carefully devised matching procedure that is explained below. The variable χ jh was carefully chosen for several reasons. On the one hand, it is well suited to gauge the energy shift on the level of observable particles, since E h f is, in principle, directly measured in experiment. This helps mitigating the event generator bias mentioned earlier. On the other hand, we have also found that χ jh is quite well approximated by the neural network compared to other possible quantities, as will be discussed extensively below. All needed information, including E h i , is readily available in the hybrid model, where each unmodified event is stored together with its medium-modified version.
Other jet quenching models, in which the vacuum evolution is factorized from the interactions with the medium, should also allow such a jet-by-jet correspondence. In this case, E h i should be thought of as a measure of the p T of an equivalent jet in vacuum, e.g., a jet with a similar p T in the cone before the stage where medium interactions are applied to the jet.
In this work, we also derive of the amount of QGP traversed by a specific jet. While it is not a quantity directly extracted by the neural network from the provided images, it is readily available from the numerical model used to generate the data. This quantity provides meaningful information that should be strongly correlated to the modifications and energy loss experienced by a jet. Given that the quantity in Eq. (2.4) refers to the length traversed by a single parton i, we construct the length traversed by a parton jet, L, from the p T weighted sum of the individual lengths of the jet constituents on partonic level L i , as This biases the extracted jet in-medium length to the one of the leading particle.

Matching procedure
Given a quenched jet of energy (on hadronic level), in order to find its vacuum partner we perform the following procedure: 1. Extract the vacuum jets by clustering the list of vacuum hadrons, i.e., the hadrons one would obtain if there was no medium modifications.
2. Extract the medium jets by clustering the list of medium hadrons, which include the hadrons fragmented from the quenched parton shower as well as the hadrons from the wake.
3. For each medium jet, get its vacuum partner by selecting the highest p T vacuum jet whose axis is within ∆R < 0.4 from the medium jet axis, where ∆R ≡ ∆φ 2 + ∆y 2 .
Finally, the p jet T of the corresponding vacuum jet, that has not experienced any medium modifications or energy loss, is identified as the initial jet energy, i.e., before quenching. In order to identify the medium partonic jet that produced the medium hadronic jet under consideration, we need to carry out a matching procedure analogous to the one outlined above. We now need to match a given hadronic jet with one of the medium partonic jets, which are the jets reconstructed using the quenched partons that were not completely absorbed in the medium.

Observables
The set of jet observables used at different stages of this work is presented here. They are classified into groomed and un-groomed observables. Starting out with the un-groomed class of observables, these include: • Differential jet shape ρ(r) (JS), i.e., the transverse energy distribution as a function of the distance r to the jet axis in the {η, φ} plane with η the pseudorapidity and φ the azimuthal angles, defined as [15,28] (2.10) • Jet fragmentation function (JFF), i.e., the distribution of hadrons with an energy fraction z with respect to the jet energy, defined as [21,24] z ≡ p track T /p jet T cos ∆R, with ∆R the distance to the jet axis in the {y, φ} plane.
Groomed jet observables are those obtained from a jet after applying a so-called grooming procedure that removes soft and/or wide-angle particles from the jet. Grooming can be achieved through a set of techniques developed to reduce the sensitivity to the soft, non-perturbative components of a jet, with the intention of gaining access to the partonic, perturbative aspects of jet substructure, see [86,87] for extensive reviews. Focusing on the so-called Soft Drop (SD) procedure in the following, this procedure consists in looking into the clustering history of a jet following the hardest branch of the clustered pair of sub-jets, analyzing any number of steps that satisfy the SD condition, defined as where R is the jet radius, p T,1 and p T,2 are the momentum of the two subjet branches under consideration, z cut and β are grooming parameters, while z g and R g are the groomed momentum sharing fraction and groomed angular separation in {y, φ} plane between the branches, respectively. The most common setup explored in the literature corresponds to the reclustering of a given anti-k T jet with the angular ordered Cambridge/Aachen (C/A) algorithm [88,89], and setting β = 0 and z cut = 0.1. This will also be the setup adopted in this work, leaving the analysis of other choices for the grooming parameters or the use of alternative grooming scenarios, such as the recently developed dynamical grooming [90], for future publications. The groomed observables we will use throughout the paper are: • The groomed momentum sharing fraction z g , as defined in (2.11). It refers to the first declustering step that satisfies the SD condition, unless explicitly stated otherwise.
• The groomed jet radius R g , i.e., the angle between the two sub-jets that satisfy (2.11). Referring to the first declustering step that satisfies the SD condition, unless explicitly stated otherwise.
• The SD multiplicity n SD , i.e., the number of times a given jet satisfies the condition (2.11) along the hardest branch during the declustering procedure.
• The groomed jet mass M g , i.e., the sum of the four-momenta of the first two sub-jets that pass the SD condition (2.11), We will present results for these observables in Section 4.1, and will show its sensitivity to the amount of energy loss χ jh as well as the physics behind their modifications with respect to the vacuum, or the absence of these. We can, however, anticipate the qualitative trends of such correlations in a compact way.

Network architectures and task description
In this work we use two conventional machine learning architectures in order to extract the energy loss fraction χ jh on a jet-by-jet basis. These are the fully-connected neural network (FCNN) and the convolutional neural network (CNN). The former architecture takes as arguments a set physically motivated observables [44] and will mainly be employed as a check on the results obtained via the latter method. We will come back to describe its details in Sec. 3.4. Since most of our results will rely on the image recognition capabilities of the CNN, we will describe it in more detail below.
The CNN architecture used in this work is similar to that discussed in Ref. [91,92]. We refer to these papers for technical details. Fig. 1 shows the neural network architecture. We use three convolutional layers and one subsequent fully-connected layer. All the convolutional layers and the fully-connected one are followed by a batch normalization [93], PReLu activation [94], dropout [95] (with a rate of 0.2 and 0.5, respectively) and average pooling (of pool size 2 × 2, following first and third convolutional layers) layer, one by one. There are 16, 16, 32 filters of size 8 × 8, 7 × 7 and 6 × 6, respectively, in these three convolutional layers, scanning through the input J(η, φ), or the previous layers, and creating 16, 16, 32 features of size 33 × 33, 17 × 17, 17 × 17, respectively. The weight and bias matrix of these convolutional kernels and dense layers are initialized with "He normal" initializer [94], i.e., truncated normal distribution with zero mean and standard deviation 2/N in with N in χ jh Jet   the number of input units in the weight tensor. They are constrained with L2 regularization [96] in the loss function. Each neuron in a convolutional layer does connect only locally to a small chunk of neurons in the previous layer by a convolution operation. This is a key reason for the success of the CNN architecture. Dropout, batch normalization, PReLU layers and L2 regularization, all work together to prevent overfitting, which would generate model-parameter-dependent features from the training dataset and thus hinder the generalizability of the method. The resulting 32 features of size 9 × 9 from the last average pooling layer are flattened and connected to a 128-neuron fully-connected layer. The output layer is another fully-connected layer with one special neuron indicating the energy loss ratio χ jh . There are overall 395789 trainable and 134 non-trainable parameters in the present neural network. The supervised learning is performed in tackling this regression task. The difference between the true label and the predicted label from the single output neuron is quantified by the Log-Cosh loss function, log(cosh(x)), which is approximately equal to x 2 /2 for small x and to abs(x) -log(2) for large x. The loss is function of the trainable parameters θ of the neural network. The training minimizes the loss function l(θ) by updating θ → θ −δθ. Here δθ = α∂l(θ)/∂θ, where α is the learning rate, with initial value 0.0001, which is adaptively changed by the AdaMax method [97]. to speed-up the convergence. The neural network is trained with 400 epochs. The model parameters are saved to a new checkpoint whenever a smaller validation loss is encountered. The fully-connected neural networks used in this work, despite inputs of different size, consist of two hidden dense layers of size 128 and 32, respectively, which are initialized with "He normal" initializer and constrained with L2 regularization. Each dense layer is followed by a dropout (with a rate of 0.2) and PReLu activation layer.

Jet sample generation and reweighting procedure
Inclusive jet samples are generated from the hybrid model usingp T,min = 50 GeV at √ s = 5.02 ATeV with an oversampling factor of the hard cross section of p 4 T to obtain sufficient statistics at high p T . The heavy-ion samples correspond to PbPb collisions in the 0-5% centrality bin, with an average temperature of T 250 MeV. Reconstructed jets with anti-k T and R = 0.4 are required to be within |η| < 2 and to have momentum 100 < p jet T < 2000 GeV. This results in approximately 250000 jets. 80% of these will serve as training samples and the rest 20% will serve as validation samples which will not be fitted by the network in the training process.
Given the wide range of jet momenta studied in this work, it is important to ensure that the most common features of the events, such as the shape of the jet spectrum or the typical energy loss fraction, do not introduce any bias in our results. From the point of view of training the neural networks, it is desirable to deal with flat distributions. Since the jet spectrum is steeply falling, in order to obtain enough statistics at higher p T we use the oversampling method, consisting in multiplying the hard cross section by a power of the p T involved in the hard process, and weighting the event accordingly at the moment of analysis. It is hard, however, to obtain flat distributions merely with this procedure, as one can see in the top left panel of Fig. 2 (left). Moreover, the energy loss ratio χ jh , the main object of interest for us, also presents a very non-flat distribution, which could again lead to biases in our results, cf. Fig. 2 (right).
The normalized joint (p T , χ jh ) distribution, shown in Fig. 3 (left), clearly visualizes that most of the analyzed jets are at low p T and have lost little energy, χ jh ≈ 1. Normalizing  the joint distribution in a column-wise fashion, cf. Fig. 3 (right) reveals rather that the typical χ jh for a given jet p T gets increasingly more peaked at high p T . This means that high-p T jets tend to lose less energy compared to their jet p T . Such behavior is expected whenever the amount of energy loss is weakly dependent on the initial jet energy [98] and will be discussed in detail below. Hence, providing the bare, joint distribution to the network could bias the performance of jet samples with low p T and large χ jh .
We will address this problem by assigning effective sample weight in the training [99], obtaining simultaneously relative flat distributions for both jet p T and χ jh . This is done in the following way. In the loss function we assign a weight to each sample which is inversely proportional to the effective sample number [99] of the bin counts of the 2-D p T and χ jh joint histogram corresponding to the jet p T and χ jh of that particular sample. The effective sample number in certain bin is (1 where N is the total number in that bin and β is the probability that a new sample in that bin is not overlapped with the previous samples. Thus the jth sample contributes β j effective sample numbers. Here we set β = 0.9998 thus the largest effective number of samples will be 5000. To make the variation of sample weights not too big in the loss function due to the limit of sample number of a batch in the training, the biggest sample weight is restricted as 20 times of the smallest sample weight corresponding to a effective number of 250. This choice of β value and restriction give a high resolution of the sample number between these bins and avoid quite different magnitude of sample weights. We see the effect of the reweighting on the level of the joint (p T , χ jh ) distribution in Fig. 4 where the left (right) plot shows the distribution before (after) the reweighting (note the log-scale on the z-axis in these histograms).

Jet image analysis
This Section first introduces the input to the network, i.e., jet image, and the pre-processing procedure applied on it. We present the average of pre-processed jet image for different χ jh ranges, respectively, and the correlations between χ jh and each pixel of jet images as well as jet observables to hint at the possibility of extracting, jet-by-jet, the amount of suffered energy loss from measurable jet properties. Next, we present our main result on the prediction performance of χ jh and explore the robustness of the performance and interpret the success of the prediction made by the algorithm.

Jet image and pre-processing
The input to the neural network is a so-called jet image J(η, φ). It represents the total p T of jet constituents deposited in the pixel of (η, φ) space with 33 η-bins and 33 φ-bins. Since we use a fixed jet radius R = 0.4, i.e, pseudorapidity |η| ≤ 0.4 and |φ| ≤ 0.4.
In general, training algorithms may benefit from pre-processing of the datasets. The input here, jet image J(η, φ), is a 33 × 33 matrix. Following Refs. [45,51], the jet image is first pre-processed by translation: the hardest groomed subjet is at (η, φ) = (0, 0). Then the rotation of the jet image around the center is applied so that the second hardest groomed subjet is at −π/2. If the second hardest groomed subjet does not exist, then the jet image is rotated by aligning the first principal component of pixel intensity distribution of jet image along the vertical axis. The final step is a parity flip such that the right side of jet image has a larger pixel intensity sum.
In Fig. 5, we show the the average jet image normalized by jet p T for different ranges of values of χ jh . We can easily recognize the general features of jet quenching phenomenology, namely that quenched objects (top left, most quenched) present a larger number of softer particles at larger angles than unquenched ones (bottom center, least quenched). The amount of soft radiation also smears the hard prong structure, which appears vertically below the core pixel, corresponding to the next-to-hardest subjet in the jet. For the most quenched sample, see Fig. 5 top left, the jets are rotationally invariant. The bottom right plot in Fig. 5 visualizes the correlation of each pixel with the extracted energy loss ratio and will be explained in next Subsection.
Besides the above pre-processing steps of jet image, some widely used pre-processing methods in computer vision, standardization of image, could be applied. We refer to each pixel of the jet image as one "feature" and each jet image as one "sample". The pre-processing of the jet image could be in a feature-wise (per pixel) or sample-wise (per image) manner.
In the feature-wise standardization, the jet images J(η, φ) of all the training samples are pre-processed in a sample-interdependent manner. Each feature is subtracted with the mean over all training samples and is divided by their standard deviation. In this way, all features are centered around zero and have variances of the same order. Thus it is prevented that one feature with larger variance dominates the objective function over other features. The transformation is saved and then will be applied to the testing samples. We refer to such transformed jet images as "feature-wise standardized jet image" in the following discussion.
In the sample-wise standardization, or min-max normalization, the jet images J(η, φ) are pre-processed in a sample-independent manner. The pixels of each jet image are rescaled either to have a zero mean and a unit variance, or to a specific range, such as [0, 1]. In this work we use the jet image normalized by the jet p T as an example of the sample-wise pre-processing method.
In this work, we will use the raw jet image, the pre-processed jet image with only translation, rotation and flipping (referred to as "jet image" in the following unless explicitly  Figure 6. Pearson correlation coefficients between χ jh and jet observables stated otherwise, which are mostly used), the feature-wise standardized jet image and the jet image normalized by jet p T , respectively, as inputs to the neural network to probe the impact of these pre-processing methods.

A first look at correlations
In the bottom right panel of Fig. 5, we show the Pearson correlation coefficient between χ jh and each pixel of the jet image (not the normalized one). The Pearson correlation coefficient between samples x and y with n population is defined as The value of the coefficient r varies in the range r ∈ [−1, 1]. A value of 1 means total positive linear correlation, 0 means no linear correlation, and -1 means full linear anticorrelation. Indeed, larger values of χ jh are characteristic of those jets that retain most of its energy in the hard structures at the main subjets. The anti-correlation between χ jh and the soft, large angle particles in the jet cone illustrates that the energy is taken away from the leading particle(s) and spread to large angles within the jet. In Fig. 6, the Pearson correlation coefficients between χ jh and the set of chosen of observables as well as the physically immeasurable quantity L are shown in ascending order. As expected, one can see that χ jh is strongly anti-correlated with the jet traversed length L in the QGP. The energy loss ratio is also strongly anti-correlated with the jet multiplicity, while it is slightly anti-correlated with jet mass M and the groomed substructure observables R g , M g and n SD , in decreasing order. Its correlation with z g is very slight.
We observe that χ jh is slightly correlated with the (final, quenched) jet p T , which is also demonstrated in Fig. 3 (right). Such correlation arises mainly from the fact that χ jh is a relative energy loss, so that at high p T the value of χ jh increases for a fixed value of absolute energy loss ∆E. Another reason is that for higher (final, quenched) p T it becomes unlikely to produce low values of χ jh ; such a jet should have started with a very large p T , close to the kinematical limit, where the spectrum dies off.
The presence of the correlations briefly discussed here clearly hint at the possibility of extracting, jet-by-jet, the amount of suffered energy loss χ jh from measurable jet properties.

Prediction performance
We train and validate the neural network with the above setup. Before examining the performance of the network, we try to understand what has been learned by the CNN by opening and visualizing it. In Fig. 7, we show the 16 filters of the first convolutional layer of the CNN by the learned weights (left panel) and the corresponding activation difference of the averaged unquenched (0.25 < χ jh < 0.5) and quenched (0.85 < χ jh < 1, see Fig. 5) normalized jet image by jet p T (right panel). One can see that these 16 filters are quite different which indicates they tend to extract different features. Some filters tend to be activated by the quenched jet images while others by the unquenched one. Features including the hardest and second hardest subjet, the distance between them and the pattern of soft particles could be captured by these filters. The jet-by-jet internal structure of soft particles are smeared in these averaged jet image so their activation is not directly visible here. Fig. 8 shows the χ jh prediction performance of CNN from the pre-processed jet image. The green column-normalized joint distribution represents the probability of predicted χ jh within the given true χ jh bin. The red line with error bar quantifies the average and standard deviation of the predicted χ jh within the given true χ jh bin. The error bar decreases with χ jh . Overall, we can see that the CNN can predict χ jh successfully over the whole range. As we check, by applying the sample weights in the training and validation, the prediction performance has a slighter dependence on χ jh . Meanwhile, the prediction performance still decreases with decreasing p T obviously, which shows the reweighting procedure cannot eliminate this trend and there has to be intrinsic physical reason, as explained above. In Appendix. B, we show the prediction performance against various jet observables in detail. In particular, those against χ jh and jet p T are shown in Fig. 22-23.
We also checked the individual performance of quark or gluon initiated jets, as can be assigned by following a matching procedure analogous to that explained in Section 2.2. Even though there are small differences around extreme values of χ jh , we observe no notable bias on the jet species in the overall performance.
In Tab. 1, we present the prediction performance from different jet images by CNN in terms of validation loss by measuring the difference between true and predicted χ jh . One can see that the prediction performance from pre-processed jet image is very close to that from raw one, which means that the pre-processing is not obviously beneficial to this regression task and CNN can get the rotation-invariant quantity χ jh automatically. We find that the feature-wise standardized jet image and normalized jet image as aforementioned in Section. 3.1 could only give well-matched or worse performance, which is due to the fact that the feature-wise standardization may distort the internal structure of jet image a bit and jet p T is an important feature in this task from above analysis, respectively.  An important test of the consistency of our procedure consists in making sure that the CNN assigns values χ jh 1 to vacuum jets created in proton-proton collisions. Having trained the network using medium jets only, we indeed predict that the energy loss ratio distribution for vacuum jets has an average value of χ jh with standard deviation as 0.98(3).

Sensitivity to soft and large-angle radiation
One of the most interesting outcomes from a machine learning task such as the one we performed consists in learning ourselves which are the features and correlations that the algorithm deems as most relevant to carry out a successful prediction. By feeding the network with different combinations of jet properties and observing the change in the performance, in this Section we demonstrate our ability to discern which are the most relevant features of the jet image. Moreover, with such information, we can construct human understandable observables which are not too far from the machine's level of performance.
As a first check, we use a groomed jet image, which is the jet image after all the branches prior to the one satisfying the SD condition for the first time are discarded, as input to CNN. In Fig. 9, we show the average of groomed jet images normalized by jet p T sliced by different χ jh bins as Fig. 5. One can see some soft and large angle particles are removed by the jet grooming procedure, but the correlation of groomed jet image with χ jh survives to a large extent. We notice that the performance of using groomed jet image is reduced compared to using the full jet image as shown in Tab. 2. Even though grooming certainly reduces contamination from soft, less well-known processes such as hadronization, the lowered performance hints that it is precisely in the soft, large angle particles where important imprints of the energy loss effects lie. It's worth mentioning that the SD parameters could be tuned to improve the performance.
In a more crude approach, we can also use the jet image where we remove soft particles (p T < 1 GeV and p T < 2 GeV) as input to the CNN and compare with the full jet image. From Tab. 2 one can see that the soft particles (e.g., with p T < 2 GeV) contain considerable discriminating information, given the big associated loss in performance, consistent with the conclusions drawn from the study of the groomed jet image. The jet grooming and crude soft particles removing actually belong to hard attention mechanism where we focus our attention on the left particles in order to understand the decision-making of neural network, see further application in [100]. This sensitivity is an interesting feature of the problem, but presents at the same time a challenge for the detailed modeling of jet quenching in the presence of a full heavy-ion background.
These observations lead us to construct a simple quantity that we will call the hard ratio χ h , defined as this is, the percentage of the total jet transverse momentum carried by "hard" particles, where here by hard we mean particles with p T > 2 GeV. This hard ratio χ h is found to be strongly correlated with χ jh as shown in column-normalized joint distribution of χ jh and χ h in Fig. 10. Note that even the fluctuation of χ h is comparable with that of the predicted χ jh by the network for the samples within a given χ jh bin, the hard ratio χ h is not as discriminating as the network, since its non-linear correlation with χ jh will not give a direct correspondence. Specially for samples with large χ jh , the hard ratio χ h has a poor discriminating power due to the small slope. We also train fully-connected neural network to predict χ jh with physics-motivated high-level features [44] which include jet shape (JS), fragmentation function (FF), jet p T , z g , n SD , R g , jet mass M , groomed jet mass M g , jet multiplicity. Here, the JS and FF are the projections of jet image along the radial and shared momentum z direction, respectively. This is not necessarily a mutually orthogonal set of observables, but captures the physics information contained in the image, as we will see from the performance. We show the correlation matrix for this set of observables in Appendix A. In addition to the jet p T , which is particularly important for our study, one could study instead an extended set of N-subjettiness variables [101], energy-flow polynomial [102], or counting observables, e.g., [103]. For a more in-depth discussion on this point see, e.g., [54,104].
From Tab. 1 and 3, one can see that the jet shape gives well-matched performance as the p T -normalized jet image does and outperforms jet FF. It's a fair comparison since jet shape also doesn't contain the jet p T information. It demonstrates that the energy distribution in the jet cone is more important than the energy sharing among these particles. It also indicates that the azimuthal distribution of particles in the jet cone contains little discriminating power, which can be attributed to the uniform splitting and radiation probability along the azimuthal angle in the jet cone. Since jet shape with 8 bin values contains main information on energy loss, we try to use it to fit the χ jh with a simple parametrization: where p T i /p T is the momentum fraction of jet p T at r i and {α, β, γ} is a 17-fitting-parameter Input (size) Output Network Loss FF (10) χ jh FCNN 0.0058 Jet shape (8) χ jh FCNN 0.0033 FF, jet shape (18) χ jh FCNN 0.0032 FF, jet shape, features (25) χ jh FCNN 0.0028 Jet image & FF, jet shape, features (25) χ jh API: CNN&FCNN 0.0028 Table 3. Predictive performance with different inputs and neural networks. Jet image is preprocessed by default. Features include: jet p T , z g , n SD , R g , M , M g , Multiplicity. set. Fig. 11 shows the fitting performance which is though a bit worse than the network prediction but already much better than the hard ratio χ h .
The combination of jet shape, FF and other physics-motivated features including jet p T , z g , n SD , R g , jet mass M , groomed jet mass M g , jet multiplicity, results in a comparable performance to the jet images. If we take the combination of these physics-motivated features and jet image both as inputs to a more complex and flexible functional neural network model consisting of CNN and FCNN using the Keras Functional API, the performance will not improve with respect to each input alone, which indicates that this combination and jet image share almost the same discriminating information. To quantitatively estimate and rank the discriminating power of each jet observables and jet image, even in each pixel, layer-wise relevance propagation method could be applied in the future, as in Ref. [105].

Applications
Having proven our ability to directly access the degree of jet-by-jet modification due to energy loss within the QGP, expressed through the energy loss ratio χ jh , we now can start asking questions about the origins and fate of a reconstructed jet. Intuitively, a large energy loss, when χ jh is significantly smaller than 1, implies significant jet-medium interaction. This could be due to the fact that the jet traversed a long distance in a hot QCD plasma or because the jet multiplicity in the medium was high, leading to a stronger quenching effect. While our current setup does not allow to disentangle these two effects, presently we introduce two novel applications we can now explore. The first deals with devising new jet observables to enhance the quenching effects. The second is a first step toward using jets truly as tomographic probes of the QGP.

Sensitivity of observables to in-medium modification
When we study a jet observable in heavy-ion collisions we are typically required to select jets within a certain momentum range, or at least with some minimum p T . In order to quantify the modification of the observable due to the presence of the medium, we usually construct a ratio between the medium distribution for the observable and the one obtained in pp collisions, for the same kinematical selection. The interpretation of the resulting ratio is, alas, not unique. It is fair to pose the question to which extent the modification that we observe is a consequence of the modification of a given jet due to energy loss. Due to underlying steeply falling production spectrum, it is unlikely to measure a jet that has lost a lot of energy, as it would have had to start at a much higher energy where the spectrum is suppressed as a negative power of the p T . This means that imposing a p T cut on the inclusive jet distribution naturally selects those kinds of jets that have on average lost the least energy. This selection bias is an unavoidable fact about jet quenching physics that limits the interpretability of conventional observables.
A jet, reconstructed with a given radius R, is not only characterised by its energy, p T . One key characteristic is the identity of the jet partonic ancestor, i.e., whether it is a quark or gluon jet. The latter has a larger color charge which both leads to a stronger interaction with the medium and a wider fragmentation pattern. In addition, within a given energy bin, jets can consist both of narrow, little fragmented structures, as well as wide, or more populated structures. Jets with more fragments will tend to lose more energy [106,107], provided that the fragments are created in (and are resolved by) the medium [42], than those with less number of components. This is simply because each parton of the jet can in principle undergo processes of elastic scattering and medium-induced radiation, so that the total amount of energy lost by the components that build an extended object like the jet roughly scales like the number of partons resolved by the medium 4 .
The different way in which jets with varying widths are quenched is now a key ingredient of our understanding of di-jet asymmetry [106], which was for long regarded as stemming solely from the differences of in-medium path length traversed by the leading and the subleading jet. Also, the fact that wider jets lose more energy, combined with the steeply falling jet spectrum and the p T -dependent quark-gluon jet fraction, allows us to understand the narrowing and hardening of the core of the inclusive jet samples measured in heavy-ion collisions compared to proton-proton collisions. Even though each individual jet could get wider as a result of the interaction with the medium, the fact that the wider 4 Here, "resolved" means that two partons coming from a given splitting have had enough time to decohere, i.e., that they are separated enough to be resolved as two independent color currents interacting with the medium. Otherwise, we say that the medium cannot resolve the dipole and it is quenched according to its combined charge only. The scales that govern the physics of resolution depend on the properties and extent of the medium and have been subject to a great amount of study [108][109][110]. These physics have been incorporated in the hybrid model [111], but for simplicity we have chosen to omit them in the present study and their effects will be explored in future work. jets tend to lose more energy can leave us with a narrower final jet ensemble if we impose a p T cut [79,112,113]. Essentially, the steeply falling spectrum turns the medium into a very effective filter in which the jets we observe for a given p T range are those that lost the least energy. Therefore, losing less energy does not simply mean traversing less amount of QGP: there is a selection bias in terms of the properties of the jet, such that narrower jets, or those with less resolved color charge, are more likely to pass the selection criteria and enter the distributions of any given observable.
In this Section, we employ our new technique to extract χ jh to investigate the response of a set of selected observables to the amount of energy loss experienced by the jet. To do so, we divide the full jet population in two selections, the "quenched" class, with χ jh < χ cut jh , and the "unquenched" class, with χ jh > χ cut jh , and compare the resulting distributions of both selections. Within our regression setup, we can choose χ cut jh freely, or even quantify the degree of quenching in finer detail using multiple classes. In our study we make a reasonable choice χ cut jh = 0.9. The modification of observables for a finer set of χ jh classes is discussed in more detail in Appendix C.
In order to understand the actual implications of performing a cut in χ jh , we show the distribution of χ jh in the left panel of Fig. 12. These results are obtained using the realistic jet spectrum for R = 0.4 anti-k T jets produced at √ s = 5.02 ATeV, where the jet is required to have p jet T > 200 GeV. We compare the results for the true label χ jh versus the predicted one χ p jh . The vertical dashed line depicts the location of the chosen χ jh = 0.9 cut. As argued in the discussion above, this distribution is biased towards jets that have lost little amount of energy as a result of the steeply falling jet spectrum. Therefore, the average values of χ jh above and below the cut will be fairly similar, as will be the modifications of their observables with respect to vacuum. This is a consequence of the fact that the jet selection, as it is done in experiments, is applied on the final, measured jet energy, both for jets created in heavy-ion collisions and in proton proton collisions.
However, we can now perform the jet selection on the initial jet energy. Using our knowledge about the amount of energy loss χ jh that a measured jet with momentum p jet T has experienced, we estimate this as p T /χ jh , as shown in the right panel of Fig. 12. Hence, we can select jet samples according only to the energy they had before quenching, such that, in contrast to the previous choice, one introduces no implicit requirement about the necessity to "pass the cut" or, in other words, to have lost relatively little energy. More specifically, we choose jets whose final energy is at least p jet T > 100 GeV, and where the initial jet p T was significantly higher, p jet T /χ jh > 200 GeV. This novel approach allows us for the first time to study jet observables in a way that before was only possible by looking under the hood of the energy loss model.
We can see in the right panel of Fig. 12 that the energy loss distribution has much more support at lower values of χ jh , so that by performing a simple cut at χ jh = 0.9 we really are separating jets into two classes with very different amounts of medium modification. From now on, we will refer to the selection based on the final jet energy (left panel of Fig. 12) as Final Energy Selection (FES), while the selection based on the initial jet energy (right panel of Fig. 12) will be the Initial Energy Selection (IES).
In the following we will show results for medium over vacuum ratios of jet observables, defined in Section 2.4, in which we compare FES (left column of each figure) with IES (right column of each figure). We also confront the results obtained using our algorithm to extract χ p jh (bottom row of each figure) against the true value of χ jh as extracted from the Monte Carlo model (top row of each figure). All results using either the true or the predicted value of χ jh share the same qualitative features, and in many cases the agreement reaches quantitative level, as we show in the Figures below.
For FES, jet selection is done with the following cuts: • Medium: p jet T > 200 GeV.
The reason why we have an extra lower cut of p jet T > 100 GeV in medium jets for IES is because lower p T jets tend to be harder to reconstruct in experiments due to their lower signal to background ratio and because for the moment we have decided not to extend our machine learning analysis down to these lower momenta.
This way of presenting the sensitivity of our results to χ jh , based on the IES and FES setups, is akin to current phenomenology standards and emphasises the role of the selection bias and the difference in jet populations when treating inclusive samples -but it is only one among several possible choices. As mentioned before, we also provide results for all the observables, finely sliced in the (true label) χ jh for the FES setup in Appendix C which the reader can access for convenience.

Groomed observables
We start by discussing groomed observables within the Soft Drop procedure, with parameters z cut = 0.1 and β = 0. In Fig. 13, we show the ratio between PbPb and pp for the number of SD splittings n SD . This is a good proxy for the jet multiplicity (although in the iterative setup it only counts the primary emissions, this is, the ones along the hardest branch of the clustering tree) and has been measured in heavy-ion collisions [34,35]. A reduction in average n SD has been observed in experiments. This feature has been attributed to energy loss and is well reproduced by jet quenching models [61,107]. With FES, we observe that these results are consistent between the "quenched" and the "unquenched" classes, as expected from the discussion above about the energy loss distributions from Fig. 12, see left column of plots in Fig. 13. Both classes contain jets that mainly lost little energy. Such jets tend to be of a special kind, and as we infer from the results in Fig. 13, they tend to have a smaller n SD than the average. Therefore, by looking only at the FES results, the actual effects of energy loss are obscured in this observable by the existence of the strong selection bias.
We now turn to the observable constructed from the jet samples based on IES, which we expect to be more enriched in quenched jets. Now, the results inclusive in χ jh no longer coincide with the "unquenched" class, revealing more of the genuine medium modification for this observable. Even though the "quenched" class still presents an overall shift towards fewer n SD , characteristic of the jets sitting close to the χ jh cut that lost relatively little energy, we observe that very quenched jets contribute to increase the number of large n SD ∼ 8 (see the more differential selection in χ jh of Appendix C) due to the copious and spread number of soft particles originated from the excited wake. Even though there are not so many very quenched jets within the "quenched" class, their effect at large n SD is notable in the medium over vacuum ratio, since in vacuum there rarely are jets with such large value of n SD . Finally, it is worth emphasizing an extremely important feature of our analysis: we also observe that the ratio between the "unquenched" class and vacuum still deviates from unity. This is a clear manifestation of the same selection bias we have discussed so far. The jet population that ultimately falls into the "unquenched" class tends to be narrower and less fragmented than the average jet population in vacuum. In order to present a ratio around unity for the "unquenched" class, one would need to restrict the denominator to a specific vacuum jets population as well, i.e., a population of characteristically narrow jets from which the in-medium jets belonging to the "unquenched" class originated. This kind of studies go beyond the regression task we have focused on in this work, and will be left for future studies.
An even more paradigmatic observable of the effect of the selection bias is the groomed angle R g , shown in Fig. 14. In vacuum, jets with a larger R g are jets that typically explore a larger phase space of in-cone emissions, leading to higher multiplicity. If the medium can resolve such internal structure, these wider jets will be more quenched than the average and they will fail to pass the cut [107]. This results in a depletion of large R g jets in PbPb compared to pp, as measured by experiments [34,35] and well reproduced by several Monte Carlo simulations [35,61,107]. While it is true that, for the conventional FES setup, the R g distribution of the jets passing the cut is not significantly modified [114], see left column of Fig 14, this is the case only because significantly quenched jets escape such selection [40]. Indeed, as shown in the IES setup in Fig. 14 (right column), energy loss effects do modify the R g of a jet through the creation of a new, semi-hard branch situated at large angles, composed mostly of the thermal particles originated in the medium response to the passage of the jet [40].
The last of the groomed observables we analyze here is the groomed momentum sharing fraction z g , shown in Fig. 15. It has been measured in heavy-ion collisions by CMS [23] and ALICE [34,35], where an enhancement at small values of the z g distribution, when R g 0.1, has been reported. Several theoretical calculations predicted an enhancement of the low z g part of the distribution due to a modification of the partonic splitting functions in the medium [61,[115][116][117], although other model studies have highlighted the importance of jet-medium recoil effects toward explaining the observed excess [118]. Even so, the thermal hadrons from the wake of the hybrid model-that we employ in our studies-are typically too soft to produce any significant enhancement at low z g for the FES setup [107]. Nevertheless, such results are consistent with measured data when one takes into account the contamination from the large fluctuating background, which precisely dominates the low z g part of the distribution at large angles [34,119]. In contrast to the previously discussed observables, the fact that in the FES setup the "unquenched" class ratio between PbPb and pp sits at unity indicates that z g is not significantly affected by the selection bias, as it does not determine the jet phase space in the way that R g does. 5 Therefore, the "unquenched" class in the IES setup also presents a ratio around unity. However, sufficiently quenched jets now have a clear enhancement at low z g values, which are the soft branches at large angles dominated by the contribution from medium response particles.

Ungroomed observables
In this Section we analyze two of the most widely studied substructure observables in jet quenching, namely the jet fragmentation function (JFF) and the jet shape (JS). We apply the same methodology to unravel effects of jet quenching as in the previous Section. From the left column of Fig. 16 we can infer that the modification of the jet fragmentation function of the "unquenched" class is an indication of the effect of the selection bias in this observable. Indeed, we observe an enhancement of the large z contribution, this is, a bias towards jets that are less fragmented than the average. Such hardening of the inclusive fragmentation function in the medium, also called jet collimation or core narrowing, has been measured by ATLAS [21,24] and is reproduced by several jet quenching models [71,[120][121][122][123]. Even though in the FES setup of Fig. 16 we observe a non-negligible modification for the "quenched class", consisting in a moderate depletion of the energy of the hardest fragments in the jet, such modifications are greatly enhanced for the IES setup. In particular, in the latter selection we can see a clear enhancement of the contribution from the soft particles coming from medium response, which is both because there is a larger number of such particles in the jets belonging to the "quenched" class of IES, and also because their relative energy fraction z with respect to the energy of the jet is now larger than for the higher p T jets from FES.
Finally, very similar qualitative statements apply to the jet shape ratio shown in Fig. 17. Both the notable depletion of the energy from the partons at the core, accompa-  nied by the sizeable enhancement of low p T particles at the periphery, is observed when going from the FES to the IES setup. However, for this observable, the separation between the "quenched" and "unquenched" class bands is wider throughout all bins in r than for the fragmentation functions across the bins in z, as seen from Fig. 16. This is consistent with our findings in Section 3.4 about the modification of JS containing more information on χ jh than JFF. The inclusive jet shape has been measured in heavy-ion collisions by CMS [15,20], ATLAS [30] and ALICE [33], and also for semi-inclusive boson-jet systems by CMS [28]. Interestingly, the modifications observed in the IES setup resemble more those of the boson-jet systems, which is due to the fact that in such analysis the triggered boson p T serves as a proxy for the initial energy of the jet (this also applies to the results from JFF in Fig. 16, which have been measured in boson-jet systems by CMS [29] and ATLAS [32]). These set of jet shape measurements are well reproduced by a variety of jet quenching models [71,[121][122][123][124] and analytic computations [125].

Tomography
The second application that we explore in this paper involves the relation between jet energy loss and traversed length L across the QGP, as defined in Eqs. (2.4) and (2.7). By selecting jets that have suffered different amounts of energy loss using our newly introduced technique to extract χ jh , we are in fact selecting jets with different in-medium path lengths L. This new capability opens the path to the extraction of QGP properties through jet quenching analysis, or in other words, tomographic applications, in ways that were not possible before 6 . It is important to keep in mind that our machine learning setup does not learn to predict L from the data. It would indeed be desirable to extract the in-medium path length L jet-by-jet directly, but carrying out such task with good enough precision has proven to be considerably more challenging than the extraction of χ jh and will hopefully be tackled in future work. The application in this Section is therefore "hybrid" in the   sense that we supplement the information extracted by the machine learning setup, i.e., χ jh , with information provided by the model that produced the data. The in-medium length distributions L are shown in Fig. 18, both for the true value of χ jh (left panel) and its predicted value χ p jh (right panel). We stress again that the value of L for each jet is not extracted by our algorithm but it is taken directly from the Monte Carlo. One can clearly see that the average value of L grows as the energy loss increases, as expected. This should mean that we are selecting jets that were created at different points in the transverse plane. Indeed, we can corroborate this statement by looking at the creation point distribution in the impact parameter plane {x, y}, presented in Fig. 19. Again, the pair of values {x, y} are not extracted through our algorithm, but taken from the Monte Carlo instead. We observe that jets that are barely quenched were originated within a ring structure at the periphery of the collision. The production point shifts towards the centre of the initial geometry as energy loss is increased. Note that for the most quenched class, 0.25 < χ jh < 0.6, the distribution becomes more spread than for the next less quenched class, 0.6 < χ jh < 0.75. This is because the very quenched jets have had to go through the largest in-medium lengths, which can correspond to configurations in which the di-jet pair was created at the periphery and where one of the jets flew inwards, towards the center of the QGP. This observation points to a possible improvement of the current analysis by taking into account the orientation of a jet with respect to the event plane of the collision, which will be left for future work.

Conclusions and Outlook
In this work we employed deep learning techniques to predict jet energy loss on a jet-by-jet basis using jet images and various physics-motivated features in a systematic and comparable way. Good performance has been achieved. We have also assessed and interpreted the success of the network using a multi-faceted approach.
We have found that the soft particles within the jet cone carry very relevant information about the magnitude of energy loss. In particular, the total number of activated pixels of the jet image, or jet multiplicity, is dominated by the production of thermal particles due to quenching and is found to be highly correlated with the energy loss ratio. This observation motivates us to construct the "hard ratio", which we define as the relative contribution of hard particles (p T > 2 GeV) to the total energy of the jet. Even though it exhibits a strong correlation with the energy loss ratio, its performance is notably worse than that obtained with deep learning techniques. Furthermore, jet shapes and jet fragmentation functions, which are the lower-dimensional projections of the normalized jet image, have also been used as inputs to the neural network. The good results obtained with the jet shapes, comparable to those obtained via the normalized jet image, provide an alternative scheme and a candidate interpretation of the good performance given by the jet image.
We corroborate these findings by constructing a universal 17-parameter fit that allows to extract the energy loss ratio with great precision merely by using the 8 bins of a given individual jet shape. The ability to extract an estimate of the energy lost by jets produced in heavy-ion collisions, achieved here through the extraction of an energy loss ratio, opens for a plethora of interesting applications and studies. As a proof of concept of the suitability of deep learning techniques to jet quenching phenomenology we have focused on two illustrative applications. First, we have used our knowledge about the relative energy loss that a given jet has suffered in order to study its observable modifications based on its initial energy. This removes to a large extent the effect of the selection bias on jet observables and enhances the contribution of very quenched jets, allowing to study jets that have actually suffered considerable energy loss and modification. We note that it would be very interesting to extract not only the initial energy of the jet, but also some measure of its initial width, or some other equivalently relevant initial feature. This would allow us to construct nuclear modification ratios in which the vacuum distributions in the denominator would not simply correspond to the inclusive proton-proton sample, but rather to a restricted set of jets from which the medium jets belonging to a specific quenching class come. In this way, selection bias effects that are still present within each quenching class could also be removed. This will be pursued in a forthcoming study.
Second, we have shown that using the extracted energy loss ratio we have a handle on which jets have traversed longer distances within the QGP, which constitutes an important step towards the exploitation of the tomographic power of hard probes. Extensions to the limited scope of the presented analysis should include the orientation of the jet with respect to the event plane of the collision, as well as more differential selections based on measures of the jet width such as, for example, R g . Indeed, given what we know about the dependence of energy loss on the width of a jet, and its relation to selection bias, as discussed in Section 4.1, it is easy to anticipate that jets with different (initial) widths, and the same value of the energy loss ratio χ jh , will in general have traversed different amounts of length. In Fig. 20 we show some results supporting this picture (using the true value of χ jh ). In this Figure we show the creation point distribution in the transverse plane, inclusive in χ jh , for jets with a measured (this is, final) R g smaller or larger than 0.1, with a measured p T > 100 GeV (here again, χ jh could be estimated by the neural network while the production points are inputs from the hybrid model). We see that narrower jets tend to pass the momentum cut even if they were produced deep inside the medium, unlike wider jets which in comparison are pushed towards the surface. Similarly to what was discussed in Fig. 19, such ordering is reversed for the very quenched class shown in the bottom row of Fig. 20, since narrow jets that get this quenched can tend to be produced towards the periphery and propagate inwards.
Finally, we bring up two potential shortcomings of our approach. Both are related to the sensitivity of soft radiation at the jet periphery. On the one hand, sensitivity to such modes leads to an expected drop of accuracy once jets are embedded into a realistic heavy-ion environment due to the presence of a large fluctuating background. Improving the performance under these conditions is highly desirable in order to extend our studies to the analysis of real experimental data. On the other hand, the modeling of such emissions is currently not under firm theoretical control, which could imply dependency on a particular model. Let us comment briefly on these two points below. In order to check the robustness of our results in a realistic scenario, we embedded our jets into a thermal heavy-ion background and performed background subtraction using constituent subtraction. To this end, we use the JetToyHI tools available in [127]. The signal particles of the reconstructed jet conform what we here call the embedded jet image. As expected, performance is decreased due to the noise associated to the presence of the large, fluctuating background, consisting of particles whose p T is of the order of that of the soft particles originated through the quenching mechanisms. Improving the performance under the conditions in which jets are actually measured in experiments should be one of the main goals for the near future. However, the reduction of the performance (the obtained validation loss is 0.0072) still allows the analysis of many of the features addressed in this paper. We postpone a more thorough study in this direction to the future.
We also deem necessary to test the robustness of our procedure to extract the energy loss ratio by applying it to other jet quenching models. Those Monte Carlos that rely on a scale separation between vacuum and medium physics that results into a marked hierarchy  Figure 21. Pearson correlation coefficient matrix between jet observables on the formation time of the corresponding emissions are particularly well suited to this endeavour. Even though these models may present different modelling assumptions regarding the form of the medium induced radiation kernels, the inclusion of coherence effects, the hydrodynamization rate or the values of certain medium parameters, to the extent that such models yield qualitatively similar predictions for jet observables one should expect our network to be capable of finding enough common generalizable features. Furthermore, given the generic, dominating effect of selection bias that affects the jet observables of all Monte Carlos, classifying jets into energy loss classes (preferably using the IES setup) will give us a clearer insight into the observable differences among the different models.
R g , M, M g and n SD are all positive. In particular, the one between R g and M g is very strong. Secondly, the correlations between z g and other jet observables are generally very slight. Lastly, as we mentioned in the main text, the travel length L is anti-correlated with χ jh , which means that longer L will lead to more energy loss in general. However, its correlations with other jet observables are not always opposite, which implies that the travel length is not the unique factor governing the energy loss ratio.

B Prediction Performance versus jet observables
In Figs. 22-30, we show the prediction performance by comparing the average of true and predicted χ jh with error bar and prediction error against various features. The error bars in the left panels are the standard deviation of the true and predicted χ jh from the right panels, while the prediction error is the root mean square of the difference between the true and predicted χ jh in the given bin of the selected features. In general, the average of the predicted χ jh agrees well with the true one against various features, respectively. One can see that the prediction error of χ jh has a relatively flat dependence on the true χ jh , which is attributed to the use of sample weights in the training. For the prediction of low χ jh samples, the deviation is a bit larger because of the poorer statistics. The average of χ jh increases with jet p T , and the standard deviations and prediction error of χ jh all decrease with jet p T even though the sample weights are applied in the training. This means that, on average, high p T jets are more likely to be less quenched and their χ jh are more easily predicted due to the reasons discussed in the main text.
The average of χ jh decreases significantly with jet travel length L and n SD , which means that longer travel length L will lead to more jet energy loss and number of Soft Drop splittings n SD is associated with larger energy loss. In contrast, the momentum sharing fraction between subjets, z g , presents no clear impact on the average of χ jh . The physical interpretation of these observations are straightforward. Moreover, the average of χ jh first decreases and then increases with R g , jet mass M and groomed jet mass M g . The prediction error increases with length L and n SD but has a very slight dependence on z g and R g , while we observe that it first increases and then decreases with jet mass M and groomed jet mass M g .   C Jet observables sliced in χ jh Fig. 31 shows normalized histograms of medium jet observables, including n SD (left), R g (middle) and z g (right) inclusive in χ jh , sliced in different χ jh bins and the ratio over the vacuum one. Fig. 32 shows jet FF (left) as well as jet shape (right) inclusive in χ jh and sliced in different χ jh bins. All plots here are based on the conventional FES setup with p jet T > 100 GeV for both vacuum and medium jets. In these Figures one can see with a greater resolution the extent to which these jet observables are sensitive to χ jh . In particular, both jet FF and jet shape display a clear evolution with χ jh , which can be taken as an indication of their potential to predict χ jh .  Figure 31. Histograms of jet n SD , R g and z g inclusive and sliced in χ jh and their ratio over vacuum.