Pileup Mitigation with Machine Learning (PUMML)

Pileup involves the contamination of the energy distribution arising from the primary collision of interest (leading vertex) by radiation from soft collisions (pileup). We develop a new technique for removing this contamination using machine learning and convolutional neural networks. The network takes as input the energy distribution of charged leading vertex particles, charged pileup particles, and all neutral particles and outputs the energy distribution of particles coming from leading vertex alone. The PUMML algorithm performs remarkably well at eliminating pileup distortion on a wide range of simple and complex jet observables. We test the robustness of the algorithm in a number of ways and discuss how the network can be trained directly on data.


Introduction
The Large Hadron Collider (LHC) is operated at very high instantaneous luminosities to achieve the large statistics required to search for exotic Standard Model (SM) or beyond the SM processes as well as for precision SM measurements. At a hadron collider, protons are grouped together in bunches; as the luminosity increases for a fixed bunch spacing, the number of protons within each bunch that collide inelastically increases as well. Most of these inelastic collisions are soft, with the protons dissolving into mostly low-energy pions that disperse throughout the detector. A typical collision of this sort at the LHC will contribute about 0.6 GeV/rad 2 of energy [1,2]. Occasionally, one pair of protons within a bunch crossing collides head-on, producing hard (high-energy) radiation of interest. At high luminosity, this hard collision, or leading vertex (LV), is always accompanied by soft proton-proton collisions called pileup. The data collected thus far by ATLAS and CMS have approximately 20 pileup collisions per bunch crossing on average ( NPU ∼ 20); the data in Run 3 are expected to contain NPU ∼ 80; and the HL-LHC in Runs 4-5 will have NPU ∼ 200. Mitigating the impact of this extra energy on physical observables is one of the biggest challenges for data analysis at the LHC.
Using precision measurements, the charged particles coming from the pileup interactions can mostly be traced to collision points (primary vertices) different from that of the leading vertex. Indeed, due the to the excellent vertex resolution at ATLAS and CMS [3][4][5] the charged particle tracks from pileup can almost completely be identified and removed. 1 This is the simplest pileup removal technique, called charged-hadron subtraction. The challenge with pileup removal is therefore how to distinguish neutral radiation associated with the hard collision from neutral pileup radiation. 2 Since radiation from pileup is fairly uniform 3 , it can be removed on average, for example, using the jet areas technique [9]. The jet areas technique focuses on correcting the overall energy of collimated sprays of particles known as jets. Indeed, both the ATLAS and CMS experiments apply jet areas or similar techniques to calibrate the energy of their jets [1,2,[10][11][12][13]. Unfortunately, for many measurements, such as those involving jet substructure or the full radiation patterns within the jet, removing the radiation on average is not enough.
Rather than calibrating only the energy or net 4-momentum of a jet, it is possible to correct the constituents of the jet. By removing the pileup contamination from each constituent, it should be possible to reconstruct more subtle jet observables. We can coarsely classify constituent pileup mitigation strategies into several categories: constituent preprocessing, jet/event grooming, subjet corrections, and constituent corrections. Grooming refers to algorithms that remove objects and corrections describe scale factors applied to individual objects. Both ATLAS and CMS perform preprocessing to all of their constituents before jet clustering. For ATLAS, pileup-dependent noise thresholds in topoclustering [14] suppresses low energy calorimeter deposits that are characteristic of pileup. In CMS, charged-hadron subtraction removes all of the pileup particle-flow candidates [15]. Jet grooming techniques are not necessarily designed to exclusively mitigate pileup but since they remove constituents or subjets in a jet (or event) that are soft and/or at wide angles to the jet axis, pileup particles are preferentially removed [6,[16][17][18][19][20][21]. Explicitly tagging and removing pileup subjets often performs comparably to algorithms without explicit pileup subjet removal [6]. A popular event-level grooming algorithm called SoftKiller [21] removes radiation below some cutoff on transverse momentum, p cut T chosen on an event-by-event basis so that half of a set of pileup-only patches are radiation free.
While grooming algorithms remove constituents and subjets, there are also techniques that try to reconstruct the exact energy distribution from the primary collision. One of the first such methods introduced was Jet Cleansing [22]. Cleansing works at the subjet level, clustering and declustering jets to correct each subjet separately based on its local energy information. Furthermore, Cleansing exploits the fact that the relative size of pileup fluctuations decreases as NPU → ∞ so that the neutral pileup-energy content of subjets can be estimated from the charged pileup-energy content. A series of related techniques operate on the constituents themselves [23][24][25]. One such technique called PUPPI also uses local charged track information but works at the particle level rather than subjet level. PUPPI 2 Charged-hadron subtraction follows a particle-flow technique that removes calorimeter energy from pileup tracks. Due to the calorimeter energy resolution, there will be a residual contribution from charged-hadron pileup. This contribution is ignored but could in principle be added to the neutral pileup contribution.
3 This work will not explicitly discuss identification of real high energy jets resulting from pileup collisions.
The ATLAS and CMS pileup jet identification techniques are documented in Ref. [6,7] and [8], respectively. computes a scale factor for each particle, using a local estimate inspired by the jets-withoutjets paradigm [26]. In this paper, we will be comparing our method to PUPPI and SoftKiller.
In this paper, we present a new approach to pileup removal based on machine learning. The basic idea is to view the energy distribution of particles as the intensity of pixels in an image [27]. Convolutional neural networks applied to jet images [28] have found widespread applications in both classification [28][29][30][31][32] and generation [33,34]. Previous jet-images applications have included boosted W -boson tagging [28][29][30], boosted top quark identification [31], and quark/gluon jet discrimination [32]. Most of these previous applications were all classification tasks: extracting a single binary classifier (quark or gluon, W jet or background jet, etc.) from a highly-correlated multidimensional input. The application to pileup removal is a more complicated regression task, as the output (a cleaned-up image) should be of similar dimensionality to the input. PUMML is among the first applications of modern machine learning tools to regression problems in high energy physics.
To apply the convolutional neural network paradigm to cleaning an image itself, we exploit the finer angular resolution of the tracking detectors relative to the calorimeters of ATLAS and CMS. Building on the use of multichannel inputs in [32], we give as input to our network three-channel jet images: one channel for the charged LV particles, one channel for the charged pileup particles, and one channel, at slightly lower resolution, for the total neutral particles. We then ask the network to reconstruct the unknown image for LV neutral particles. Thus our inputs are like those of Jet Cleansing but binned into a regular grid (as images) rather than single numbers for each subjet [22]. Further, the architecture is designed to be local (as with Cleansing or PUPPI), with the correction of a pixel only using information in a region around it. The details of our network architecture are described in Section 2. Section 3 documents its performance in comparison to other state-of-the-art techniques. The remainder of the paper contains some robustness checks and a discussion in Section 6 of the challenges and opportunities for this approach.

PUMML algorithm
The goal of the PUMML algorithm is to reconstruct the neutral leading vertex radiation from the charged leading vertex, charged pileup, and total neutral information. Since neutral particles do not have tracking information available, the challenge is to determine what fraction of the total neutral energy in each direction came from the leading vertex and what fraction came from pileup. To assist this discrimination, we take as inputs into our network the energy distribution of charged particles, separated into leading vertex and pileup contributions, in addition to the total neutral energy distribution 4 . A natural way to combine these observables is using the multichannel images approach introduced in [32] based on color-image recognition technology.
We apply this machine learning technique to R = 0.4 anti-k t jets. The jet image inputs are square grids in pseudorapidity-azimuth (η, φ) space of size 0.9×0.9 centered on the charged leading vertex transverse momentum (p T )-weighted centroid of the jet. One could combine all layers to determine the jet axis, but in practice the axis determined from the charged leading vertex captures dominates because of its superior angular resolution and pileup robustness. To simulate the detector resolutions of charged and neutral calorimeters, charged images are discretized into ∆η × ∆φ = 0.025 × 0.025 pixels and neutral images are discretized into ∆η × ∆φ = 0.1 × 0.1 pixels 5 . We use the following three input channels: Only charged particles with p T > 500 MeV were included in the green or blue channels. Charged particles not passing this charged reconstruction cut were treated as if they were neutral particles. Otherwise, the separation into channels is assumed perfect. No image normalization or standardization was applied to the jet images, allowing the network to make use of the overall transverse momentum scale in each pixel. The different resolutions for charged and neutral particles initially present a challenge, since standard architectures assume identical resolution for each color channel. To avoid this issue, we perform a direct upsampling of each neutral pixel to 4 × 4 pixels of size ∆η × ∆φ = 0.025 × 0.025 and divide each pixel value by 16 such that the total momentum in the image is unchanged.
In summary, the following processing was applied to produce the pileup images: 1. Center : Center the jet image by translating in (η, φ) so that the total charged leading vertex p T -weighted centroid pixel is at (η, φ) = (0, 0). This operation corresponds to rotating and boosting along the beam direction to center the jet.
3. Upsample: Upsample each neutral pixel to sixteen ∆η × ∆φ = 0.025 × 0.025 pixels, keeping the total transverse momentum in the image unchanged. Leading vertex neutral Inputs to NN 10 filters ×2 Figure 1: An illustration of the PUMML framework. The input is a three-channel image: blue/purple represents charged radiation from the leading vertex, green is charged pileup radiation, and yellow/orange/red is the total neutral radiation. The resolution of the charged images is higher than for the neutral one. These images are fed into a convolutional layer with several filters whose value at each pixel is a function of a patch around that pixel location in the input images. The output is an image combining the pixels of each filter to one output pixel.
The convolutional neural net architecture used in this study took as input 36 × 36 pixel, three-channel pileup images. Two convolutional layers, each with 10 filters of size 6 × 6 with 2 × 2 strides, were used after zero-padding the input images and first convolutional layer with a 2-pixel buffer on all sides. The output of the second layer has size 9 × 9 × 10, with the 9 × 9 part corresponding to the size of the target output and the 10 corresponding to the number of filters in the second layer. In order to project down to a 9 × 9 × 1 output, a third convolution layer with filter size 1 × 1 is used. This last 1 × 1 convolutional layer is a standard scheme for dimensionality reduction. A rectified linear unit (ReLU) activation function was applied at each stage. A schematic of the framework and architecture is shown in Fig. 1.
All neural network implementation and training was performed with the python deep learning libraries Keras [38] and Theano [39]. The dataset consisted of 56k pileup images, with a 90%/10% train/test split. He-uniform initialization [40] was used to initialize the model weights. The neural network was trained using the Adam [41] algorithm with a batch size of 50 over 25 epochs with a learning rate of 0.001. The choice of loss function implicitly determines a preference for accuracy on harder pixels or softer pixels. To that end, the loss function used to train PUMML was a modified per-pixel logarithmic squared loss: wherep is a hyperparameter that controls the choice between favoring all p T equally (p → ∞) or favoring soft pixels (p → 0). After mild optimization, a value ofp = 10 GeV was chosen, though the performance of the model as measured by correlations between reconstructed and true observables is relatively robust to this choice. PUMML was found to give good performance even with a standard loss function such as the mean squared error, which favors all p T equally. The PUMML architecture is local in that the rescaling of a neutral pixel is a function solely of the information in a patch in (η, φ)-space around that pixel. The size of this patch can be controlled by tuning the filter sizes and number of layers in the architecture. Further, due to weight-sharing in convolutional layers, the same function is applied for all pixels. Building this locality and translation invariance into the architecture ensures that the algorithm learns a universal pileup mitigation technique, while carrying the benefit of drastically reducing the number of model parameters. Indeed, the PUMML architecture used in this study has only 4,711 parameters, which is small on the scale of deep learning architectures, but serves to highlight the effectiveness of using modern machine learning techniques (such as convolutional layers) in high energy physics without necessarily using large or deep networks.
While we considered jets and jet images in this study, the PUMML architecture using convolutional nets readily generalizes to event-level applications. The locality of the algorithm implies that the trained model can be applied to any desired region of the event using only the surrounding pixels. To train the model on the event level, either the existing PUMML architecture could be generalized to larger inputs and outputs or the event could be sliced into smaller images and the model trained as in the present study. The parameters of the PUMML architecture are the convolutional filter sizes, the number of filters per layer, and the number of convolutional layers, which may be optimized for a specific application. Here, we have presented an architecture optimized for simplicity and performance for jet-level pileup subtraction. PUMML is designed to be applicable at both jet-and event-level.

Performance
To test the PUMML algorithm, we consider qq light-quark-initiated jets coming from the decay of a scalar with mass m φ = 500 GeV. Events were generated using Pythia 8.183 [42] with the default tune for pp collisions at √ s = 13 TeV. Pileup was generated by overlaying soft QCD processes onto each event. Final state particles except muons and neutrinos were kept. The events were clustered with FastJet 3.1.3 [43] using the anti-k t algorithm [44] with a jet radius of R = 0.4. A parton-level p T cut of 95 GeV was applied and up to two leading jets with p T > 100 GeV and η ∈ [−2.5, 2.5] were selected from each event. All particles were taken to be massless.
Samples were generated with the number of pileup vertices ranging from 0 to 180. Since the model must be trained to fix its parameters, the learned model depends on the pileup distribution used for training. For our pileup simulations, we trained on a Poisson distribution of NPUs with mean NPU = 140. For robustness studies, we also tried training with NPU= 140 for each event or NPU= 20 for each event. The average jet image inputs for this sample are shown in Fig. 2. For comparison, we show the performance of two powerful and widely used constituent-based pileup mitigation methods: PUPPI [23] and SoftKiller [21]. In both cases, default parameter values were used: R 0 = 0.3, R min = 0.02, w cut = 0.1, p cut T (NPU) = 0.1 + 0.007 × NPU (PUPPI), grid size = 0.4 (SoftKiller). Variations in the PUPPI parameters did not yield a large difference in performance. Both PUPPI and SoftKiller were implemented at the particle level and then discretized for comparison with PUMML. We show the action of the various pileup mitigation methods on a random selection of events in Fig. 3. On these examples, PUMML more effectively removes moderately soft energy deposits that are retained by PUPPI and SoftKiller.
To evaluate the performance of different pileup mitigation techniques, we compute several observables and compare the true values to the corrected values of the observables. To facilitate a comparison with PUMML, which outputs corrected neutral calorimeter cells rather than lists of particles, a detector discretization is applied to the true and reconstructed events. Our comparisons focus on the following six jet observables: • Jet Mass: Invariant mass of the leading jet.
• Dijet Mass: Invariant mass of the two leading jets.
• Jet Transverse Momentum: The total transverse momentum of the jet. . Different pixelizations are used for charged and neutral images to reflect the differences in calorimeter resolution. The charged and total neutral images comprise the three-channel input to the neural network, which is trained to predict the neutral leading vertex image.
• Neutral Image Activity, N 95 [45]: The number of neutral calorimeter cells which account for 95% of the total neutral transverse momentum.
• Energy Correlation Functions, ECF  : Depictions of three randomly chosen leading jets. Blue/purple represents charged radiation from the leading vertex, green is charged pileup radiation, and yellow/orange/red is the neutral radiation. Shown from left to right are the true neutral leading vertex particles, the event with pileup and charged leading vertex information, followed by the neutral leading vertex particles predicted by PUMML, PUPPI, and SoftKiller. From examining these events, it appears that PUMML has learned an effective pileup mitigation strategy.
in Fig. 5. To numerically explore the event-by-event effectiveness, we can look at the Pearson linear correlation coefficient between the true and corrected values or the interquartile range (IQR) of the percent errors. Table 1 summarizes the event-by-event correlation coefficients of the distributions shown in Fig. 4. Table 2 summarizes the IQRs of the distributions shown in Fig. 5. PUMML outperforms the other pileup mitigation techniques on both of these metrics, with improvements for jet substructure observables such as the jet mass and the energy correlation functions.

Robustness
It is important to verify that PUMML learns a pileup mitigation function which is not overly sensitive to the NPU distribution of its training sample. Robustness to the NPU on which it is trained would indicate that PUMML is learning a universal subtraction strategy.     and then tested on samples with different NPUs. Fig. 6 shows the jet mass correlation coefficients as a function of the test sample NPU. PUMML learns a strategy that is surprisingly performant outside of the NPU range on which it was trained. Further, we see that by this measure of performance, PUMML consistently outperforms both PUPPI and SoftKiller. A related robustness test is to probe how the performance of PUMML depends on the p T spectrum of the training sample. To explore this, we generated two large training samples (50k events): one with a scalar mass of 200 GeV and one with a scalar mass of 2 TeV; we did not impose any parton-level p T cuts on these samples. After training these two networks, we tested them on a set of samples generated from scalars with intermediate masses, from 300 GeV to 900 GeV. As can be seen in Fig. 7, the performance of PUMML is very robust to the p T distribution of the jets in the training sample: the networks trained on the 200 GeV resonance and the 2 TeV resonance have identical performance. The figure also shows that the performance of PUMML is less sensitive to of the p T of the testing sample than either PUPPI or Soft-Killer. This robustness test speaks to the PUMML algorithm's ability to learn universal aspects of pileup mitigation.
A number of modifications of PUMML were also tried. Locally connected layers were tried instead of convolutional layers and were found to perform worse due to a large increase in the number of parameters of the model, while losing the translation invariance that makes  PUMML is surprisingly performant well outside the NPU range on which it was trained and consistently outperforms PUPPI and SoftKiller. Note that PUMML trained on the lower NPU sample better reconstructs the jet mass in the low pileup regime. PUMML powerful. We tried training without various combinations of the input channels; the model was found to perform moderately worse without either of the charged channels but suffered severe degradation without the total neutral channel. We tried using simpler models with only one layer or fewer filters per layer. Remarkably, even with only a single layer and a single 4 × 4 filter (a model that has just 49 parameters), PUMML performed only moderately worse than the version presented in this study, which was allowed to be more complicated in order to achieve even better performance.  : Correlation coefficients between reconstructed and true jet masses plotted as a function of the mass of the scalar resonance with NPU=140. A spread in scalar resonances is generated in order to produce a range in jet transverse momenta. In order to assess the impact of the p T distribution used for training, one version of PUMML was trained with a scalar mass of 200 GeV (black) and one was trained with a mass of 2 TeV (gray). The two PUMML curves closely match one another.

What is PUMML learning?
While it is generally very difficult to determine what a network is learning, one possible probe is to examine the weights of the filter layers in the convolutional network. For our full network, these weights are complicated and the subtractor that the network learns is difficult to probe analytically. Instead, we trained a simplified PUMML network with a single 12 × 12 pixel filter, which spans 3 × 3 neutral pixels, with no bias term. The different channels of this filter are shown in Fig. 8. The neutral filter clearly selects the relevant neutral pixel for subtraction, while the charged pileup filter is approximately uniform (with the value dependent on the specific choice of loss function and activation function), and the charged leading vertex filter does not significantly contribute.
The filter values motivate the following parameterization of what PUMML is learning: This is reassuring as the learned subtractor is thereby robust in the NPU → 0 limit despite begin trained on NPU = 140.
Eq. (5.1) is remarkably similar to the physically-motivated formula used in Jet Cleansing [22]. Cleansing is built on the observation that since pileup is the incoherent sum of many separate scattering events, its variance is smaller than the variance of the radiation from the leading-vertex. Thus, it is better to estimate p N,PU charged reconstruction cut and β = 1.18 with the cut. These values are consistent with those predicted by Eq. (5.4) of 0.62 and 1.26, respectively.
On the other hand, if we take a mean squared error loss function: then the minimum occurs at: This still depends only on the pileup properties, as with Linear Cleansing, but also depends on correlations between neutral and charged radiation. For example, training the 12 × 12 PUMML filter without a ReLU or bias term using a mean squared error loss function, we find β = 0.56 with no charged reconstruction cut and β = 0.97 with the cut. These numbers are in general agreement (within 10 − 20%) with a direct evaluation of the right-hand side of Eq. (5.6). In the limit that neutral and charged pileup radiation are constant, Eq. (5.6) reduces to Eq. (5.4).
Whether the loss function of Eq. (5.4) or Eq. (5.6) (or something else entirely) is better is not simple to establish. The inclusion of the ReLU activation function further complicates the analysis since the model is equally penalized for all non-positive predictions. We find with the single 12 × 12 filter, using the loss function of Eq. (2.1) and including a ReLU and bias term, PUMML achieves a jet mass correlation coefficient of 90.4%. This is competitive with the values listed in Table. 1, as we might expect since Linear Cleansing has comparable performance to PUPPI and SoftKiller. The full network improves on Linear Cleansing by exploiting additional correlations that are hard to disentangle by looking at the filters.

Conclusions
In this paper, we have introduced the first application of machine learning to the critically important problem of pileup mitigation at hadron colliders. We have phrased the problem of pileup mitigation in the language of a machine learning regression problem. The method we introduced, PUMML, takes as input the transverse momentum distribution of charged leading-vertex, charged pileup, and all neutral particles, and outputs the corrected leading vertex neutral energy distribution. We demonstrated that PUMML works at least as well as, and often better than, the competing algorithms PUPPI and SoftKiller in their default implementations. It will be exciting to see these algorithms compared with a full detector simulation, where it will be possible to test the sensitivity to important experimental effects such as resolutions and inefficiencies.
There are several extensions and additional applications of the PUMML framework beyond the scope of this study. As mentioned in Section 2, PUMML can very naturally be extended from jet images to entire events. Applying this event-level PUMML to the problem of missing transverse energy would be a natural next step. While the filter sizes can be the same for the event and jet images, the network training will likely require modification. Furthermore, the inhomogeneity of the detector response with |η| will require attention. Another potentially useful modification to PUMML would be to train to predict the neutral pileup p T rather than the neutral leading vertex p T in order to increase out-of-sample robustness of the learned pileup mitigation algorithm. Additionally, using larger-R jets may be of interest, thereby necessitating a resizing of the local patch or other PUMML parameters, all of which is easily achieved.
An important consideration when using machine learning for particle physics applications is how the method can be used with data and whether or not the systematic uncertainties are under control. Unlike a purely physically-motivated algorithm, such as PUPPI or SoftKiller, machine learning runs the risk of being a "black-box" which can be difficult to understand. Nevertheless, machine learning is powerful, scaleable, and capable of complementing physical insight to solve complicated or otherwise intractable problems.
To prevent the model from learning simulation artifacts, it is preferable to train on actual data rather than simulation. In many machine learning applications in collider physics, obtaining truth-level training samples in data is a substantial challenge. To overcome this challenge in classification tasks, [47] introduces an approach to train from impure samples using class proportion information. For PUMML and pileup mitigation more broadly, a more direct method to train on data is possible. To simulate pileup, we overlay soft QCD events on top of a hard scattering process, both generated with Pythia. Experimentally, there are large samples of minimum bias and zero-bias (i.e. randomly triggered) data. There are also samples of relatively pileup-free events from low luminosity runs. Thus we can construct high-pileup samples using purely data. This kind of data overlay approach, which has already been used by experimental groups in other contexts [48,49], could be perfect for training PUMML with data. Therefore, an implementation of ML-based pileup mitigation in an actual experimental setting could avoid mis-modeling artifacts during training, thus adding more robustness and power to this new tool.