Learning to Isolate Muons

Distinguishing between prompt muons produced in heavy boson decay and muons produced in association with heavy-flavor jet production is an important task in analysis of collider physics data. We explore whether there is information available in calorimeter deposits that is not captured by the standard approach of isolation cones. We find that convolutional networks and particle-flow networks accessing the calorimeter cells surpass the performance of isolation cones, suggesting that the radial energy distribution and the angular structure of the calorimeter deposits surrounding the muon contain unused discrimination power. We assemble a small set of high-level observables which summarize the calorimeter information and close the performance gap with networks which analyze the calorimeter cells directly. These observables are theoretically well-defined and can be studied with collider data.


INTRODUCTION
Searches for new physics and precision tests of the Standard Model at hadron colliders have long relied on leptonic decays of heavy bosons, due to the relatively low background rates and excellent momentum resolution compared to hadronic final states. In the case of muons, the primary source of background to prompt muons (those from W, Z or other bosons) is production within a heavy-flavor jet. This non-prompt background is largest at lower values of muon transverse momentum, which has become important in searches for supersymmetry [1-3] as well as low-mass resonances [4].
The current state of the art strategy for distinguishing prompt and non-prompt muons in experimental searches involves techniques which integrate information from multiple detector systems [5,6]. Critical to these strategies is the concept of isolation, which is sensitive to the presence of an associated jet that produces many tracks and calorimeter deposits. While the entire detector is worth studying [7], here we focus on the nature of the information available in the calorimeter. There, the traditional approach is to use a robust and simple method, measuring: T within a cone R = ∆φ 2 + ∆η 2 < R 0 surrounding the muon [8]. Typically a single cone is used, with values of R 0 in the 0.1-0.45 range. This approach relies on identifying a typical characteristic of the signal, low calorimeter activity in the vicinity of the muon.
The traditional strategy, however, focuses on the simple nature of the signal and may overlook the rich set of characteristics offered by the background object, which can provide handles for additional rejection power. Related work, which approaches similar object classification tasks as a background jet rejection problem, has shown significant improvement in background discrimination when applied to photons [9,10], pions [11] or electrons [12]. Other studies have shown that muons which fail the traditional isolation requirement can contain power to reveal new physics [13].
At the same time, there have been significant advances in machine learning techniques and their applications in physics [14,15], specifically in the context of jet classification tasks, which take a fuller view of the object by directly analyzing the low-level calorimeter energy deposits, representing them either as a type of image [16,17] or as a list [18].
It seems likely, therefore, that these machine learning strategies may identify the presence of significant additional calorimetric rejection power in the context of prompt muon identification. In this paper, we apply machine learning tools similar to those developed for jet calorimeter analysis to the task of distinguishing muons due to heavy boson decay from those produced within a heavy-flavor jet, analyze the nature of the information being used, and assemble a set of interpretable calorimeter features which capture that additional classification power.

APPROACH AND DATASET
The observable I µ (R 0 ) is a powerful discriminator which reduces a large amount of information to a single high-level scalar. However, it is possible that it fails to capture the fullness of the calorimeter information available to distinguish prompt muons from those which are produced within a jet. To probe whether information has been lost, we compare the performance of deep neural networks which access the full calorimeter information to shallow networks which use one or more isolation cones.
Neural network decisions are notoriously difficult to reverse-engineer [19][20][21][22][23], especially when the dimensionality of the data is large, as is the case for networks which directly use the low-level calorimeter cells. Understanding the nature of the decisions is particularly vital when the training is done with simulated samples, as it leads to valid concerns about the application of such complex strategies to collider data.
In this study, our goal is not to develop deep networks for use in collider data. Instead, we apply these deep networks as a probe, to measure a loose upper bound on the possible classification performance, and provide insight into whether information has been lost in the reduction of the calorimeter cells to isolation cones.
Where information has been lost, we attempt to capture it, not by applying the deep network, but by assembling a small set of new high-level (HL) observables that bridge the performance gap and reproduce the classification decisions of the calorimeter cell networks [24]. These high-level observables are more compact, physically interpretable, can be validated in data, and allow for the straightforward assessment and propagation of systematic uncertainties.

Data generation
Samples of simulated prompt muons were generated via the process pp → Z → µ + µ − with a Z mass of 20 GeV. Non-prompt muons were generated via the process pp → bb. Both samples are generated at a center of mass energy √ s = 13 TeV. Collisions and heavy boson decays are simulated with Madgraph5 v2.6.5 [25], showered and hadronized with Pythia v8.235 [26], and the detector response simulated with Delphes v3.4.1 [27] using the standard ATLAS card and root version 6.0800 [28]. The classification of these objects is sensitive to the presence of additional proton interactions, referred to as pile-up events. We overlay such interactions within the simulation with an average number of interactions per event of µ = 50, as an estimate of LHC Run 2 experimental data.
Muons in the range p T ∈ [10, 15] GeV with |η| < 2.53 were considered; see Fig. 1. To avoid inducing biases from artifacts of the generation process, signal and background events are weighted such that the distributions in p T and η are uniform, using 32 bins in each dimension. Only events where a muon is identified as a track in the muon spectrometer are used. In total, 499,970 events were used, where 249,991 were signal and 249,979 were background. Both the signal and background datasets are randomly split as: 83% training, 8.5% validation, and 8.5% testing sets. Calorimeter deposits can be represented as images where each pixel value represents the E T deposited by a particle [16]. Images are formed by considering cells in the calorimeter within a cone of radius up to ∆R = 0.45 surrounding the muon location after propagating to the radius of the calorimeter.
We use a 32x32 grid, which approximately corresponds with the calorimeter granularity of ATLAS and CMS. Heat maps of the calorimeter energy deposits in η − φ space for both signal prompt muons and background nonprompt muons are shown in Fig. 2. The signal calorimeter deposits are uniform and can be attributed to pileup whereas the background deposits appear largely radially symmetric with a dense core from the jet.
We calculate the standard muon isolation observable I µ (R 0 ) for a set of cones with 0.025 ≤ R 0 ≤ 0.45 in 18 equally spaced steps.
Crucially, these isolation observables and all other calorimeter observables are calculated directly from the pixels of the muon images, ensuring that they contain a strict subset of the information available. This allows for direct and revealing comparisons of the performance between networks trained with the images and those trained with I µ . Note that pixelization of the detector may in- cur some loss of information relative to the underlying segmentation of the calorimeter.
However, this work focuses on examining the relative power of different techniques, rather than identifying the best performance under the most realistic scenario.

NETWORKS AND PERFORMANCE
We apply several strategies to the task of classifying prompt and non-prompt muons, using both low-level calorimeter information and higher-level isolation quantities. We evaluate the performance of each approach by comparing the integral of the ROC (Receiver Operating Characteristic) curve, known as the AUC (Area Under the Curve). The uncertainty for the AUC is calculated by training 100 randomly initialized models with the same hyperparameters on different bootstraps of the data. In this case, we seek to determine the statistical uncertainty due to the stochastic training method, rather than any systematic uncertainty due to the calorimeter resolution.
For the high-level quantities, the standard approach of using a single isolation cone yields an AUC of 0.787 for the optimal cone size, R 0 = 0.425 1 . We hypothesized that additional cones would provide useful information about the radial energy distribution. Including a second cone with a distinct R 0 value as input to a small neural network (see Appendix A) slightly improves performance, with an AUC of 0.793. To estimate the full information available in the cones, we perform a greedy search through all 18 cones; we find that a set of 10 cones 2 yields another small boost in classification power up to an AUC of 0.803, as shown in Fig. 3. Performance was fairly insensitive to the specific choices of cone sizes, and does not grow significantly beyond 10 cones. Feed-forward dense networks are trained to use the information in one or more isolation cones (see the Appendix for details on network architectures and training).
We next examine whether additional information is available by applying strategies which access the calorimeter information at the lowest-level and highestdimensionality. Convolutional networks (CNN) are applied to the muon images [15][16][17]. As an alternative, we apply particle-flow networks (PFN) [18], which are mathematically structured as sums over inputs and thus are invariant to permutations of the inputs.
The muon image CNN achieves a significantly higher performance than the isolation-only networks, with an AUC of 0.841, and the particle flow network reaches 0.857, see Fig. 4 and Table I. This immediately suggests that there is significant additional information available to distinguish between the prompt and nonprompt muons beyond what is summarized in the isolation cones. A more restricted version of the PFN, an Energy-Flow Network [18] (EFN), which enforces infrared and collinear (IRC) safety, achieves nearly the same performance, 0.849. This suggests that most of the additional information beyond the isolation cones is IRC-safe.
These results support the conventional wisdom that a significant fraction of the information relevant for classification is captured by a single, simple cone. However, they also indicate that there is additional information in the radial distribution of energy, which can be captured by using multiple cones. However, even many cones fail to match the performance of the networks which use the calorimeter cell information directly, suggesting that there is additional non-radial information relevant to the classification task not captured in the isolation cones. This is likely due to a difference between the muon axis, the center of the isolation cones, and the jet axis.

ANALYSIS
The networks which use the calorimeter cells directly have the most powerful performance, but our aim is not simply to optimize classification performance in this particular simulated sample. Instead, we seek to understand the nature of the learned strategy in order to validate it and translate it into simpler, more easily interpretable high-level features which can be studied in other datasets, real or simulated. In addition, this understanding can reveal how well the strategy is likely to generalize to other kinds of jets that are not represented by this background sample, such as charm jets.
The CNN and PFN results indicate that the radially symmetric isolation cones are failing to utilize some information which is relevant to the classification task. In this section, we search for additional high-level observables which capture this information.

Search Strategy
Interpreting the decisions of a deep network with a high-dimensional input vector is notoriously difficult. Instead, we attempt to translate its performance into a smaller set of interpretable observables [24]. This allows us to understand the nature of the information being used as well as to represent it more compactly.
One might imagine exploring a set of physicallymotivated quantities, such as the relative p T between the jet and the muon or the energy-weighted average distance between the jet and calorimeter cells. These particular quantities were considered and found to not contribute significant power in addition to the isolation cones.
Instead, we use a systematic approach and explore a formally complete set of observables. As the background non-prompt muons are due to jet production, we search within a set of observables originally intended for analysis of jets: the Energy Flow Polynomials (EFPs) [29], a formally infinite set of parameterized engineered functions, inspired by previous work on energy correlation functions [30], which sum over the contents of the cells scaled by relative angular distances. An EFP for a jet with M constituents which considers N correlators with angular connections k, l is written as: Here, p Ti is the transverse momentum of cell i, and ∆η ij (∆φ ij ) is the pseudorapidity (azimuth) difference between cells i and j. These parametric sums correspond to the set of all isomorphic multigraphs where: each k-fold edge ⇒ (θ ij ) k .
As the EFPs are normalized, they capture only the relative information about the energy deposition. For this reason, in each network that includes EFP observables, we include as an additional input the sum of p T over all cells, to indicate the overall scale of the energy deposition. The original IRC-safe EFPs require κ = 1. To more broadly explore the space, we consider examples with κ = 1 to explore a broader space of observables 3 .
In principle, the space spanned by the EFPs is complete, such that any jet observable can be described by one or more EFPs of some degree. One might consider simply searching this space for all possible combinations of EFPs for a set which maximizes performance for this task. Such a search is computationally prohibitive; instead, we follow the black-box guided algorithm of Ref. [24], which iteratively assembles a set of EFPs that mimic the decisions of another guiding network (the PFN in our case) by isolating the portion of the input space where the guiding network disagrees with the isolation network, and finding EFPs which mimic the guiding network's decisions in that subspace.
Here, the agreement between networks f (x) and g(x) is evaluated over pairs of (x, x ) by comparing their relative classification decisions, expressed mathematically as: and referred to as decision ordering (DO). A DO= 0 corresponds to inverted decisions over all input pairs and DO= 1 corresponds to the same decision ordering. As prescribed in Ref. [24], we scan the space of EFPs to find the observable that has the highest average decision ordering (ADO) with the guiding network when averaged over disordered pairs. The selected EFP is then incorporated into the new network of HL features, HLN n+1 , and the process is repeated until the ADO plateaus.

IRC Safe Observables
As the elements of the EFP space are not orthogonal, there are potentially many combinations of EFP observables which capture the relevant information. As simpler EFPs may be more conducive to theoretical interpretation, we begin our search in a restricted subset of the EFP space. Specifically, we consider those which are IRC safe (κ = 1), have a simple angular weighting (β ∈ [1, 2]), and n ≤ 3 fewer nodes with at most three edges between nodes. We also include p T , where the summation is over all calorimeter cells in the image, to set the scale accompanying the normalized EFPs. The first EFP observable identified is a simple three-point correlator: z a z b z c θ ab θ bc θ ca which, when combined with the isolation cones and p T , yields an AUC of 0.838 and an ADO with the CNN of 0.891, a significant boost relative to just using the radial information of the isolation cones. The subsequent scans produce variants of this observable : z a z b θ ab with additional edges corresponding to higher powers of the angular information. Their power may come from their sensitivity to the collimated radiation pattern of the jet. Together with the isolation cones, these observables reach an AUC of 0.842 and an ADO with the PFN of 0.888, see Table I. This set of observables partially closes the performance gap with the best calorimeter cell networks, indicating that angular information is relevant to the muon isolation classification task, but fails to fully match its performance. Distributions of these EFPs for signal and background are shown in Fig. 5. Further scans in this limited space do not yield significant boost in AUC or ADO values. The strong result of the IRC-safe EFN indicates that it is possible to capture nearly all of the classification power using IRC-safe graphs, likely requiring graphs with complexity beyond what we have considered.
A scan guided by the CNN rather than the PFN yields very similar results, with identical choices for the first three EFPs.

IRC-unsafe Observables
To understand the nature of the remaining information used by the PFN but not captured by the isolation cones and the IRC-safe observables, we expand the search space to include observables which are not IRC safe (κ ∈ [−1, 0, 1 4 , 1 2 , 1, 2]), with alternative angular powers (β ∈ [ 1 4 , 1 2 , 1, 2, 3, 4]) and with up to n = 7 nodes and d = 7 edges. A scan of these observables finds a set of 5 which, when combined with the isolation cones and p T reach an AUC of 0.857. Figure 6 shows the EFP graphs as well as their distributions for prompt and non-prompt muons. They include single point-graphs, with no angular powers, as well as a two-point correlators with large angular power sensitive to high-angle effects, and more complex graphs with multiple nodes. We note that due to the overlapping nature of the large space of EFPs, there are several sets of EFPs which achieve similar performance. Again, a similar scan guided by the CNN rather than the

DISCUSSION
The performance of the networks which use the lowlevel calorimeter cells indicates that information exists in these cells which is not captured by the isolation cones, see Table I. A guided search through the space of IRCsafe EFPs closes most of the gap between these networks, giving us some insight as to the nature of the information. A broader search is able to complete the bridge, yielding the same performance as the low-level network, but employing IRC-unsafe EFPs. The multi-point correlators may be sensitive to the width of the jet, due to the momentum of the constituents relative to the jet axis, as a result of b-and c-quark decays.
A comparison of the network complexity for the various approaches is shown in Tab. I. The set of high-level features (isolation cones and EFP graphs) matches the PFN performance with 10 times fewer parameters, supporting the notion that the high-level features are effectively summarizing the relevant low-level information.

CONCLUSIONS
We have applied deep networks to low-level calorimeter deposits surrounding prompt and non-prompt muons in order to estimate the amount of classification power available and to probe whether the standard methods are fully capturing the relevant information.
The performance of the calorimeter cell networks significantly exceeds the benchmark approach, a single isolation cone. The use of several isolation cones provides some improvement, suggesting that there is additional useful information in the full radial energy distribution. However, a substantial gap remains, hinting that there is non-radial structure in the calorimeter cells which provides useful information for classification. We map the strategy of the calorimeter cell networks into a set of energy flow polynomials, finding four IRC-safe, simple three-point correlators which capture a significant amount of the missing information. As they are simple functions of the energy deposition, they can be physically interpreted, and the fidelity of their modeling can be studied in control regions in collider data. Any boost in the efficiency to identify prompt muons is extremely valuable to searches at the LHC, especially those with multiple leptons, where event-level efficiencies depend sensitively on object-level efficiencies.
Additional, more complex EFPs provide a further modest boost in performance, closing the gap with the PFN. The strong performance of the IRC-safe EFN suggests that most of the additional information beyond the isolation cones is IRC-safe.
More broadly, the existence of a gap between the performance of state-of-the-art high-level features and networks using lower-level calorimeter information represents an opportunity to gather additional power in the battle to suppress lepton backgrounds. Rather than employing black-box deep networks directly, we have demonstrated the power of using them to identify the relevant observables from a large list of physically interpretable options. This allows the physicist to understand the nature of the information being used and to assess its systematic uncertainty. Here we have focused on twodimensional projections of the calorimeter response, but longitudinal information expressed in three dimensions may offer additional power in future work. While these studies were performed with simulated samples, similar studies can be performed using unsupervised methods [31,32] on samples of collider data, which we leave to future studies.

IX. ACKNOWLEDGEMENTS
We would like to thank Michael Fenton, Dan Guest and Jesse Thaler for providing valuable feedback and insightful comments and Yuzo Kanomata for computing support. We also wish to acknowledge a hardware grant from NVIDIA. This material is based upon work supported by the National Science Foundation under grant number 1633631. DW is supported by the DOE Office of Science. The work of JC and PB in part supported by grants NSF 1839429 and NSF NRT 1633631 to PB.  vs background. The model had dropout [39,40] with value 0.2388 on the fully connected layers and an initial learning rate of 0.0003 and batch size of 128. The Particle Flow Network (PFN) is trained using the energyflow package [41]. Input features are taken from the muon image pixels and preprocessed by subtracting the mean and dividing by the variance. The PFN uses 3 dense layers in the per-particle frontend module and 3 dense layers in the backend module. Each layer uses 100 nodes, relu activation and glorot_normal initializer. The final output layer uses a sigmoidal logistic activation function to predict the probability of signal or background. The Adam optimizer is used with a learning rate of 0.0001 and trained with a batch size of 128.

D. Isolation Cone and EFP Networks
The isolation inputs and EFPs are preprocessed by subracting the mean and dividing by the variance. We trained neural networks with two to eight fully connected hidden layers depending on the hyperparameter value and a final layer with a sigmoidal logistic activation function to predict the probability of signal or background.
For the minimal set of isolation inputs, the best model we found had 2 fully connected layers with 197 rectified linear hidden units [38] and a learning rate of 0.0003 and dropout rate of 0.0547.   [42], defined in the text.

ADO comparison
In Fig. 7, the ADO between the various networks is shown.