Machine learning templates for QCD factorization in the search for physics beyond the standard model

High-multiplicity all-hadronic final states are an important, but difficult final state for searching for physics beyond the Standard Model. A powerful search method is to look for large jets with accidental substructure due to multiple hard partons falling within a single jet. One way for estimating the background in this search is to exploit an approximate factorization in quantum chromodynamics whereby the jet mass distribution is determined only by its kinematic properties. Traditionally, this approach has been executed using histograms constructed in a background-rich region. We propose a new approach based on Generative Adversarial Networks (GANs). These neural network approaches are naturally unbinned and can be readily conditioned on multiple jet properties. In addition to using vanilla GANs for this purpose, a modification to the traditional WGAN approach has been investigated where weight clipping is replaced by drawing weights from a naturally compact set (in this case, the circle). Both the vanilla and modified WGAN approaches significantly outperform the histogram method, especially when modeling the dependence on features not used in the histogram construction. These results can be useful for enhancing the sensitivity of LHC searches to high-multiplicity final states involving many quarks and gluons and serve as a useful benchmark where GANs may have immediate benefit to the HEP community.


Introduction
Even though collimated sprays of particles (jets) produced from high energy quarks and gluons are ubiquitous at the Large Hadron Collider (LHC), analyzing their substructure has proven to be a powerful tool in the search for physics beyond the Standard Model (SM) [1,2].Many theories of physics beyond the SM predict new particles with cascade decays that can result in large multiplicity final states.When many quarks and gluons are produced in these cascades, multiple large-radius jets with non-trivial substructure can be created [3][4][5].As a result, one powerful method for searching for new particles in the all-hadronic channel is to look for events with a large j∈J m j , where m j is the jet mass and J is a set of jets in an event.The key challenge for such an analysis is to estimate the SM background, as high multiplicity multi-jet final states are difficult to accurately predict with current simulation tools.
Based on the approximate factorization of quantum chromodynamic (QCD) jet production at the LHC [6], the authors of Ref. [7] proposed an innovative background estimation technique.The idea of the procedure is to estimate the conditional probability p(m j |jet kinematics) with an event selection suppressed in signal and then to convolve it with the jet kinematic spectrum in the signal region (from data).A comparison between the predicted m j and observed m j is then sensitive to the presence of new particles.The ATLAS collaboration has successfully applied this method in both Run 1 and Run 2 to set strong limits on potential gluino and squark production [8,9].
The background estimation procedure described above has two major limitations 1 .First, p(m j |jet kinematics) is represented as a histogram and each bin is uncorrelated so many events are needed for a precise determination.Second, there are physics and detector effects which change the distribution of the jet mass between the region it is constructed ('trained') and the region where it is applied ('tested').For example, the quark/gluon composition of the background may be different between the two regions.One way to mitigate this source of method bias is to condition on more features of the jet when constructing the conditional probability.To reduce the impact of changes in quark/gluon composition, one could add the number of charged-particle tracks inside the jet.Gluon jets tend to have more particles than quark jets due to their larger color factor.Since the templates p(m j |jet kinematics) are binned, it is not simple to condition on more features as one needs many more bins and thus larger samples of events for training.
This paper proposes a solution to both of these limitations using modern machine learning.Deep neural networks are becoming popular tools for classification and regression tasks in high energy physics (HEP) data analysis, but there is a growing machine learning literature on neural network-based generative models as well.Training a generative model can be viewed as a regression task that maps noise to structure, mimicking the Jacobian from a pre-defined probability distribution to a target probability distribution.One of the most well-studied paradigms for such models is the Generative Adversarial Network (GAN) [10] (details in Sec. 2).GANs have also been studied in HEP and show great promise for accelerating simulations [11][12][13][14][15][16][17][18][19][20][21] and may also be useful for other tasks such as sampling from the space of effective field theory models [22].In the context of QCD factorization studied in this paper, the GAN will learn the probability distribution of the jet mass given the jet kinematics and any other useful information.
This paper is organized as follows.Section 2 introduces GANs and how they can be used to exploit QCD factorization.The application of GANs to the phase space relevant to the Supersymmetry (SUSY) search from Ref. [7][8][9] is presented in Sec. 3. The paper ends with conclusions and outlook in Sec. 4.

Overview of GAN setup
The goal of this section is to introduce an approach to learn the conditional distribution of the jet mass given various kinematic features.Neural networks are chosen due to their flexibility and the resulting algorithms are naturally unbinned.There are multiple neural network-based approaches to generative modeling such as Variational Autoencoders (VAE) [23,24], Mixture Density Networks (MDN) [25], and Generative Adversarial Networks (GAN) [10].GANs are selected because they can readily model asymmetric distributions and accommodate conditional features.
Generative Adversarial Network training uses a pair of neural networks: one to map noise into structure (generator) and one to classify (discriminator) physics-based examples from the generator examples.The generator is structured as a densely connected feed-forward neural network that takes as input both jet kinematic features and noise and outputs a jet mass.The generated masses are then input to the discriminator network where they are compared against masses from a physics-based generator that match the kinematic quantities.The two networks 'compete' until the discriminator network is as bad as possible, which means that the generator is proficient at modeling the conditional jet mass distribution.This minimax structure uses the loss function described in Ref. [10], constructed to minimize the Jensen-Shannon divergence between the distribution of the real data (in this case, a physics-based simulator) and the distribution of the generated data.A schematic of this GAN is shown in Fig. 1.
The GAN networks for the jet mass are relatively small compared to others in the literature, which are mostly used for modeling image data.For the vanilla GAN implementation, both the Generator and Discriminator networks are composed of three hidden layers between input (kinetic variables and noise/mass for generator/discriminator respectively) and output (generated mass/likelihood for generator/discriminator respectively).The first layer has 160 neurons, the second has 80 neurons, and the last hidden layer has 40 neurons.Additionally, the generator network is made more robust by adding 50% dropout [26].Network weights were chosen using the Adam optimizer [27] with early stopping.These settings were chosen after a modest hyper-parameter scan.
All of the neural networks are built using Tensorflow [28] on Nvidia GeForce 1080 Ti GPUs.Since the jet mass distribution is skewed toward the high mass tail, the input noise distribution was varied according to a skew-normal distribution [29]: where α is a hyper-parameter and φ and Φ are the probability density and cumulative distribution of the standard normal distribution, respectively.The best value of the skew parameter was identified to be α = 5 (labeled skew in the results plots).This will be compared to the standard no-skew case (α = 0; labeled noskew).

Modified WGAN
As an alternative to the vanilla GAN described in Sec.2.1, a popular variant called the Wasserstein-GAN (WGAN) [30] was also studied (for another HEP application, see Ref. [17]).The WGAN differs from the vanilla GAN in that it minimizes the Earth-Mover distance (also known as the Wasserstein Distance): where γ ∈ Π(P real , P generated ) is a joint distribution with marginal distributions P real , P generated respectively, and (x, y) ∼ γ means that the random variable (x, y) is drawn from the distribution γ.This 'softer' metric was introduced as a way to combat the vanishing gradients that often occured when training regular GANs [30,31].In the algorithm suggested by the WGAN paper to minimise the Earth-Mover distance, the discriminator is a function f that is taken from a space of trial functions that are all K-Lipshitz for some K.To enforce such conditions, the functions f are constructed as feed-forward neural networks with weights w that are clipped (after every update) to a compact space [−α, α] for some fixed α which enforces K-lipshitz for some K.The generator network is a feed-forward network that takes as input particles with their mass replaced by noise, and generates mass according to a learnt distribution.These fake particles are bundled with real particles and passed to the discriminator, which learns to discriminate between the real and fake distributions.

Particles
An exact implementation of the WGAN approach resulted in weights that would often aggregate around the specific limit values α chosen, leading to vanishing gradients and generated mass distributions that did not match the physical mass distributions.The challenges surrounding the clipping operation to enforce the Lipshitz condition are discussed in the original WGAN paper, and explored in variations of WGAN where the clipping is replaced by gradient penalty [32] or asymmetric clipping [33].
In this paper, a modification to weight clipping is introduced that changes how the weights act on the outputs of neurons to enforce the Lipshitz condition.A schematic for this modification to the WGAN is shown in Fig. 2. The original weight-clipping operation enforces a Lipshitz function because the composition of Lipshitz functions is still a Lipshitz function; in particular, there is the weak assumption that all the activation functions used in the neural networks are themselves Lipshitz (true for the most popular activation functions including sigmoid, tanh, and ReLU).Then, the fact that there is a single constant K for which all the functions f that we are considering are K-Lipshitz is due to the weight-clipping -this restricts the function space to be compact.
Another natural way to enforce the Lipshitz constraint that eliminates boundary effects is to draw weights from a compact space with no boundaries.One way to achieve this is to draw the weights and biases from the unit circle 2 : where x m represents the output of the m-th neuron in the k-th layer of the neural network and ψ is the activation function.The bias is replaced by b i = e iφn and the weights are replaced by w nm = e iθ (k) nm .

Overview of RPV-SUSY and the Template Method
Supersymmetry (SUSY) [35][36][37][38][39][40] is a well-studied extension of the Standard Model in which there is a new symmetry relating fermions and bosons.In models of SUSY that conserve R-parity [41][42][43][44][45] -an additional symmetry that requires SUSY particles to be produced in pairs -collider signatures often feature large missing transverse momentum (MET) carried by the lightest supersymmetric particle (LSP) which must be neutral under the electromagnetic 2 In principle, one could generalize this idea to modify the WGAN where all weights and biases are drawn from a generic compact Lie Group.The outputs of the neurons exist in a vector space with the Lie Group acting by a chosen representation.In this framework, a normal linear neural network uses the Lie Group R with canonical action on itself, and the modified WGAN shown in Eq. 2.2 uses U (1) ≈ S 1 .The point of this construction is that the compactness of the weight-space assures the convergence of the WGAN and is well-formed because Lie Group actions can be differentiated.Such applications of Lie Groups to Machine Learning have been explored elsewhere in the literature, for example in applications to 3D-classification problems [34].

Figure 2.
A schematic diagram to illstrate the modified WGAN, where all the weights and biases are angles.The compactness of the unit circle then guarantees that the trained functions are all K-lipschitz for sufficiently large K.The inputs to the layer are x 1 , ..., x n and there is a hidden layer with two nodes y 1 and y 2 (in general, there can be many more than two) followed by a single output layer z 1 .
The equation shows the form of the activation that maps x 1 , ..., x n to y 1 .
and strong forces.However, R-parity is not present in all SUSY models and collider-based limits relying on MET are typically insensitive to such models.One R-parity violating (RPV) coupling (often denoted λ ) results in gluino/neutralino decay into three quarks, as illustrated in Fig. 3.When this coupling is present, traditional searches for SUSY are largely ineffective because there can be little MET and no charged leptons.Despite lacking standard handles for separating potential SUSY events from SM background processes, RPV signatures like those in Fig. 3, all-hadronic SUSY searches have been able to set strong limits on gluino and squark production in models with a large λ (see references within Ref [8,9] for other constraints on such models).One approach to search for all-hadronic SUSY decays like those in Fig. 3 is to exploit the high-multiplicity of hard, well-separated partons in the final state.In the recent ATLAS analysis [9], the main kinematic observable used in the search for λ RPV decays is the total jet mass M J Σ defined as the sum of the masses of the four leading large R = 1.0 jets.Multiple jets with a large mass are generated from well-separated high-energy partons that happen to be clustered within a single jet.The SM multijet background is estimated by a data-driven method [7], whereby mass templates are constructed from control regions containing a low purity of potential signal.These templates are histograms that model the dependence of QCD mass on jet kinematic properties (p T and η).If the jets in the signal region are only due to QCD, then the mass distribution can be estimated by convolving the jet kinematics with the mass templates.This estimate is compared with the actual mass distribution and deviations would be an indication of BSM physics.q q χ0 χ0 g g q q q q q q q q q q Figure 3. Schematic Feynman-like diagram for RPV SUSY.
The goal of this study is to demonstrate that GANs may be a useful alternative method to simple histogram templates for learning the dependence of the jet mass on jet properties. Region Table 1.Phasespace requirements for the different regions considered..

Simulation Setup
To demonstrate the potential of GAN-based templates, pp → jets at √ s = 13 TeV are generated with Pythia 8.223 [46] using the fast detector simulation Delphes 3.4.0[47] with the detector card delphes_card_ATLAS.tcl.Following the ATLAS analysis strategy [9], events are clustered [48] into jets using the anti-k t algorithm [49] with radius parameter R = 1.These jets are groomed according to the trimming procedure [50] where subjets with radius R = 0.2 are created from the large-radius jet constituents and removed if their transverse momentum is below 5% of the parent jet's p T .The remaining large-radius jets are only considered if p T > 200 GeV and |η| < 2.0 and are divided into control, validation and signal regions according to Table 1.The control region is used to construct the templates (train the GAN) and the signal region is where they are applied 3 .The validation region is expected to be sufficiently devoid of potential signals that it can be used to study the fidelity of the templates.Even though the Pythia is only leading order in the strong coupling constant, Ref. [9] used the same setup and found a good agreement with data so this setup is sensible for testing new methods.
As a baseline, mass templates (histograms) are constructed following the ATLAS study: jets in the control region divided into 4 |η| bins defined uniformly between 0 and 2, and 15 p T bins defined uniformly in log 10 (p T ).This results in 60 mass histograms in total.No b-tagging is applied, though future extensions to include flavor tagging information are possible.Using the mass histograms, jets in the validation and signal regions can be dressed with random masses given their p T and η.

Machine Learning Results
After the selections described in the previous section, there were 1.1 million jets in the control region and 30k in the validation region.The validation region is used to test the efficacy of the neural network training, filling the role of the 'test set' and by construction is independent from the training set.The jets in the control region are split 50%-50% for the purpose of training the neural network and 'validating' the network to enforce the early stopping condition.
The accuracy of the generated mass templates was quantified 4 using the separation power metric [51,52]: where P 1 , P 2 are probability distributions over a space X and Eq.3.1 is normalized to be between 0 and 1.For the RPV SUSY case, X R p T × R η × R N × R m , one real line for each of the jet properties p T , η, constituent track multiplicity 5 (N ), and m.The P i are the real and generated distributions: where G(m|p T , η, N ) is the learned mass distribution by the GAN for a given p T , η, N value, and 3 explicitly encodes the QCD factorization of the jet kinematics and the mass given those kinematic properties.Neither the GAN or physics-based simulator provide P directly; instead, only examples are drawn from the distributions.Empirical distributions are constructed from samples and the separation power is approximated by first binning the jets into eight regions of equal statistics in the kinematic variables p T , η, N , by splitting the events into two collections based on the jet p T , then splitting each of these collections evenly in η, and then in N .In each of these eight bins, the jet mass distribution is used to calculate the (binned) separation power for each dataset and then all eight sets 6 are combined.
Figure 4 shows the separation power for various generative models as a function of the number of epochs used to train.GAN models that were initialized with a high separation error (above 0.6) training much slower due to vanishing gradients, so only those GANs with an initialized value below 0.6 are considered for Fig. 4. Furthermore, to reduce the impact of fluctuations in the initialization, the average value over ten random initializations are used.By construction, the default template method does not involve neural networks and thus is constant.As desired, all of the GAN approaches converge to a separation power that is 4 There is no unique way to monitor the GAN performance during training.For a typical GAN trained with images, this is qualitatively different than classifier training because the entire multi-dimensional probability is being modeled, not just the likelihood ratio.The 1D case here is not as extreme, but still different than classification or regression monitoring.One can use the full GAN loss, the discriminator loss, or any divergence that gives a 'distance' between probability distributions.This particular divergence is popular in HEP and is therefore used as a diagnostic here.
5 Due to their robustness to pileup and excellent angular resolution, charged-particle tracks are associated to jets and used as proxy for the number of particles inside the jet. 6The estimated separation power, for our dataset size, is not sensitive to increasing bin number beyond 8. smaller than the template method, as they have access to more and unbinned information.For both the vanilla and the WGAN, using a skew-normal distribution for the noise accelerates the training time.The Vanilla GAN also converges faster than the WGAN, though all GAN approaches have a similar final separation power.
The mass distributions in the validation region are presented in Fig. 5, in bins of jet p T , η, and N .The average jet mass scales approximately as α s × p T × R and the width of the distribution also grows with p T .Given the jet p T , the jet mass should be approximately independent of η, aside from detector effects.Gluon jets have a large jet mass than quark jets and also a higher constituent multiplicity so there is a positive correlation between N and mass.
Overall, the level of agreement between the GAN and the real mass distributions is better than for the template method and the real distributions.This is particularly true for N , where the real mass is shifted to lower values for low N (more quark-like) and to higher values for high N (more gluon-like).The GAN is typically well within 50% of the real distribution, while the template method can be much more than a factor of 2 off of the real distribution for low ) mass distributions compared with distributions from the template method and the vanilla GAN in bins of jet p T (top row), η (middle row), and N (bottom row).The uncertainty in the ratio was calculated as the 1-sigma error assuming poisson distributions of events in each bin.The error shown in the plots is the calculated statistical error.The corresponding plot in the control region is qualitatively similar, but converges quicker.and high N .This is particularly important if the quark/gluon composition is different between the control, validation, and signal regions, either by chance or because some quark-tagging is applied to suppress the QCD background in such high multiplicity final states [53].
The modeling of the jet mass distribution in the validation region is used to determine systematic uncertainties on the templates.Figure 6 explicitly constructs the systematic uncertainty as the sum in quadrature of the non-closure from the validation region and the control region statistical uncertainty.These uncertainties are computed for the j∈J m j distribution (sum of the masses of the top four jets in the event), which is the main observable used in the RPV SUSY search [8,9].In blue we have the deviation from the exact ratio 1 for the template mass distributions and the GAN mass distributions.The GAN outperforms the template method in both the low mass and high mass limits; also note that when placing a cut on N track , the performance of the GAN becomes even more pronounced.This is particularly encouraging because we expect such RPV-SUSY signals to be quark jet dominated, with a lower multiplicity on average than a gluon jet dominated background.For a 50% quark jet efficiency requirement on all four jets (N track < 26), the uncertainty for the GAN approach is about 20% in the high i∈J m j tail while it is well over 100% for the template appraoch.
An important part of any background estimation technique is the associated systematic uncertainty.One of the main sources of uncertainty here is the limited size of the training set in the control region.The authors of Ref. [7] suggested a bootstrapping technique to estimate the uncertainty by rerunning the template procedure on bootstrapped datasets.In principle, one could do the same procedure for the GAN training, with one GAN per bootstrap dataset.More sophisticated methods include modeling GAN weights and biases as nuisance parameters to be profiled by the data with prior distributions.
A challenge for assessing previous GAN applications in HEP is that they have been designed to model high-dimensional feature spaces that are difficult to visualize and study [11][12][13][14][15][16][17][18][19][20][21].The mass distribution example presented here provides a concrete testing ground to study GAN approaches where quantitative agreement can be studied and achieved using existing techniques.While this study used only leading-order simulations of jet production, the methods are applicable more generally and can be applied to collision data from the LHC experiments.

Conclusions
Generative Adversarial Networks have been proposed as an alternative to histogram-based mass templates for the background estimation in LHC searches for RPV SUSY.These methods rely on the approximate QCD factorization whereby a jets type and kinematic properties are sufficient for determining distribution of the jet mass.The neural network approaches are naturally unbinned and can be readily conditioned on multiple jet properties.In addition to using vanilla GANs for this purpose, a modification to the traditional WGAN approach has been investigated where weight clipping is replaced with drawing weights from a naturally compact set (in this case, the circle).Both the vanilla and modified WGAN approaches were able to outperform the histogram method, especially when modeling the dependence on features not used in the histogram construction.When training such generative models for physical applications, the usual limitations of the method apply such as the potential for overfitting, sensitivity to hyperparameters, and vanishing gradients slowing training downthough methods of circumnavigating these limitations have been studied in the last few years, such as using 'softer metrics' such as the Wasserstein metric.These results can be useful for enhancing the sensitivity of LHC searches to high-multiplicity final states involving many quarks and gluons and serve as a useful benchmark where GANs may have immediate benefit to the HEP community.

Figure 1 .
Figure 1.Flowchart describing how GANs are used to learn templates (shown here mass templates for the RPV-SUSY search) given kinematic variables.The generator network is a feed-forward network that takes as input particles with their mass replaced by noise, and generates mass according to a learnt distribution.These fake particles are bundled with real particles and passed to the discriminator, which learns to discriminate between the real and fake distributions.

Figure 4 .
Figure 4.Estimated separation power between the generated jet-kinematic distribution and the real jet-kinematic distribution for various GAN architectures.The template error corresponds to the baseline approach with no neural networks and is thus independent of the number of training epochs.

Figure 5 .
Figure5.The physics-based ('real') mass distributions compared with distributions from the template method and the vanilla GAN in bins of jet p T (top row), η (middle row), and N (bottom row).The uncertainty in the ratio was calculated as the 1-sigma error assuming poisson distributions of events in each bin.The error shown in the plots is the calculated statistical error.The corresponding plot in the control region is qualitatively similar, but converges quicker.