DijetGAN: A Generative-Adversarial Network Approach for the Simulation of QCD Dijet Events at the LHC

A Generative-Adversarial Network (GAN) based on convolutional neural networks is used to simulate the production of pairs of jets at the LHC. The GAN is trained on events generated using MadGraph5 + Pythia8, and Delphes3 fast detector simulation. We demonstrate that a number of kinematic distributions both at Monte Carlo truth level and after the detector simulation can be reproduced by the generator network with a very good level of agreement. The code can be checked out or forked from the publicly accessible online repository https://gitlab.cern.ch/disipio/DiJetGAN .


Introduction
In the forthcoming years, experiments at the Large Hadron Collider (LHC) are expected to cope with a deluge of data. At the same time, the strategy to produce reliable and statistically large samples of simulated proton-proton (pp) inelastic collisions will be facing both technological limitations and new opportunities.
In the context of Machine Learning (ML) and specifically Deep Neural Networks (DNNs), an unsupervised learning technique called Generative-Adversarial Networks (GANs) has been proposed recently [1] to approximate a probability density function (pdf) from a set of unlabelled training examples. In practice, GANs have been widely used to generate photorealistic portraits [2], music [3], and recently also to simulate the response of a calorimeter to the passage of particles [4][5][6]. Other deep learning methodologies such as variational autoencoders (VAE) [7] have been applied to reproduce kinematic distributions learned from examples taken from Monte Carlo simulations, but failed in practice when a GAN was used [8]. In this work, we successfully trained a GAN to reproduce kinematic distributions, improving on such previous attempts, thanks to a careful consideration of the implicit symmetries of the physics process under study and the employment of convolutional layers [9][10][11].
In a GAN, a generative network G transforms a vector of random numbers (input noise) z ∼ p z into a sample carrying some physical meaning, which in this case are the fourmomenta of the two jets. In practical applications, p z is usually a N -dimensional uniform distribution in the range [0, 1] N . Subsequently, a discriminative network D estimates the probability that a given sample comes either from the training data or the generator. The two samples are distributed with probability density functions p data and p f ake respectively, with p data fixed and usually estimated using a Monte Carlo method. The Nash equilibrium (min-max game) is reached when D is unable to distinguish fake examples from real data, hence the generator has been trained to be a good approximator of the data pdf, i.e. p f ake ∼ p data . The procedure is equivalent to the minimization of the loss function L(G, D) : For a fixed G, the optimal discriminator D G is given by Under this condition, the min-max game can be reformulated as the minimization of the Jensen-Shannon divergence JS, which can be expressed in terms of the Kullback-Leibler divergence KL, i.e. a measure of how one probability distribution is different from a second: The optimal generator G is such that p f ake = p data , hence the minimum of the loss function is achieved, i.e. D G (x) = 1/2, JS(p f ake ||p data ) = 0 and L(G , D ) = −2 log 2.
In this article, we introduce the architecture of a GAN that aims to simulate the production of pairs of particles in pp interactions at the LHC. In particular, this approach is applied to dijet production, which is a ubiquitous source of background to Standard Model precision measurements [12-14] and searches for physics beyond the Standard Model [15][16][17][18].
Currently, both the ATLAS [20] and CMS [21] experiments of the LHC at CERN deploy Monte Carlo (MC) event generators such as MadGraph5 [22], POWHEG-BOX [23] and MC@NLO [24] to simulate the hard-scattering (HS) process, Pythia8 [25] and Her-wig7 [26] for the parton-shower (PS) process, and a GEANT4 [28] simulation of the actual detector for the response of the experimental apparatus. Best estimates suggest that the simulation of a single event takes already several minutes [29], with O(10 9 ) events to be generated for each simulation campaign, leading to a huge computational footprint both in terms of CPU usage and disk space. The situation is particularly critical for processes that require the matching of fixed-order next-to-leading matrix element calculations to parton shower ("multijet merging"), as implemented for example in the FxFx [30], MePSNLO [31] and UNLoPS [32] methods. Currently, such advanced MC generators cannot be used as the standard generators by the LHC collaborations due to the large time required to generate the billions of events normally required. Hence the experiments need to rely on simpler, but less accurate, generators. Providing a solution to extend the simulated events to the requirements of the LHC experiments will significantly enhance a wide range of measurements. Similarly, detailed simulation based on GEANT4will not be an affordable solution due to the large time required to simulate an event [29]. Both experiments are already using fast simulation and are developing new tools exploiting ML and other advanced statistical techniques. We will demonstrate that the same GAN used for reproducing the generator output can be also used to reproduce a simulation of a detector response with a significant time gain with respect to full simulation.

Physics of QCD dijet events
At hadron colliders such as the LHC, pp collision result in the interaction of quarks and gluons (collectively referred to as partons). According to calculations based on the SM, these parton-scattering processes via strong interactions described by quantum chromodynamics (QCD) result in the overwhelming majority of the cases in two outgoing partons which carry a net color charge and evolve from high to low virtuality producing parton showers, which eventually hadronize into collimated highly-energetic clusters of particles called jets. Hence, 2→2 parton scattering processes with a pair of jets in the final state are called dijet events. The relationship between the clusters and the original partons is revealed by the execution of a clustering algorithm [27]. One can think of a jet approximately as a cone of radius R whose axis correspond to the direction of flight of the initial parton. The size of the radius can be controlled by setting a distance parameter in the clustering algorithm.
Usually, jets emerge from pp collisions with high transverse momentum (p T ) and large angle with respect to the incoming partons. The jet mass, defined as the norm of the four-momentum sum of constituents inside a jet, is only loosely related to the mass of the originating parton, and comes mostly from the dynamics of strong interactions. Programs such as Pythia8 [25] and Herwig7 [26] implement such calculations with beyond the leading-logarithm (LL) accuracy in what are called a Parton Shower algorithms. The jet mass also plays a key role in the identification of Lorentz-boosted hadronically decaying massive particles such as top quarks [12, 17], vector (W and Z) and Higgs bosons [19]. Finally, QCD predicts a characteristically smooth and monotonically decreasing distribution for the dijet invariant mass (m jj ), and small production angle. Instead, many theories beyond the SM predict the presence of additional massive particles decaying to dijet pairs, whose hypothetical presence would distort both the dijet invariant mass and production angle distributions in model-dependent ways [15,16]. It is therefore extremely important to be able to reproduce these distributions in a simulation, with particular emphasis on the region of the phase space with m jj > 1 TeV, where signs of physics beyond the Standard Model may become evident.
In the following sections, the agreement between MC calculations and the output of the GAN is evaluated by comparing the individual jets' and dijet system's transverse momentum, pseudo-rapidity 1 (η) and mass distributions. The χ 2 between the MC and the GAN distributions is used as figure of merit.

Monte Carlo Sample
A sample of 2 million dijet events has been generated using MadGraph5 and Pythia8. The response of the detector was simulated by a Delphes3 [33] fast simulation, using settings that resemble the ATLAS detector. An average of 25 additional soft-QCD pp collisions (pile-up) were overlaid to reproduce more realistic data-taking conditions.
Electrons, muons, jets and missing transverse energy are reconstructed by Delphes3 algorithms before and after the detector simulation. These two levels of reconstruction are referred to in the following sections as particle-and reco-level respectively. At particlelevel, only stable final-state particles, i.e. particles that are not decayed further by the generator, and unstable particles 2 that are to be decayed later by the detector simulation, are considered. Jets were reconstructed using the anti-k T algorithm [34] as implemented in FastJet [35], with a distance parameter R = 1.0.
To increase the number of events with both jets with p T > 250 GeV, a cut on the scalar sum of the transverse momenta of the outgoing partons H T > 500 GeV was applied. Approximately 1,5 million events passed this selection at particle level, and about 800,000 at reco level. These events were used to train the network in the subsequent steps.

Network Architecture
The overall architecture of the network, summarized in Fig. 1, is composed of two main blocks: a generator (G) and a discriminator (D), based on convolutional layers. All layers have LeakyReLU activation functions [36] except the last layers that have either tanh or sigmoid for the generator and the discriminator respectively. The generator transforms a vector of 128 random numbers drawn from a uniform distribution in the [0, 1] range into a vector of 7 elements representing the p T , η and mass of the leading jet, and the p T , η, φ and mass of the second-leading jet. The network is implemented and trained using Keras v2.2.4 [37] with Tensorflow v1.12 [38] back-end. Input features are scaled in the range [−1, 1] and pre-and post-processed using the scikit-learn [39] and Pandas [40] libraries. 1 Pseudorapidity is a commonly spatial coordinate describing the angle of a particle relative to the beam axis defined as η = 1 2 ln E+p L E−p L , where E is the energy and pL is the longitudinal component of the momentum. It is related to the other components of the momentum via the relationship |p| = pT cosh η. 2 Particles with a mean lifetime τ > 300 ps The loss function of the generator is mean squared error, while that of the discriminator is the binary cross-entropy. The optimizer is in both cases Adam [41] with learning rate lr = 10 −5 , β 1 = 0.5 and β 2 = 0.9. The parameters described above are those that provide the best results among many values and configurations tested. Having reached a satisfactory performance, no further parameter optimisation was carried out.

Training
For the purpose of the training, all MC events were rotated so that the azimuthal φ angle of the leading jet is always zero. A significant performance improvement was achieved by exploiting the intrinsic φ symmetry in di-jet events; the φ of the leading jet is set to zero while the φ of the other jet is set to the absolute value of the difference in φ between the two generated jets. This transformation is reversed when events are generated. In order to further deploy the symmetries of dijet kinematics, every event is used twice: first in its original configuration, and then with the sign of the pseudorapidity of each jet reversed (η-flip). During event generation, the η of the jets is randomly flipped to remove any nonphysical effects that could be introduced by the GAN. The network was then trained for 100,000 iterations with mini-batches of 32 events each, drawn from the original distribution and from the noise-generated fakes. It took about one hour to complete the training on a GPU NVIDIA Quadro P6000. For each iteration, we first trained the discriminator to distinguish between real and fake events. Then, the discriminator weights are fixed and the generator is trained. Figs. 2 and 3 show the discriminator loss and accuracy, and the generator loss as a function of the training epoch at particle and reco level respectively. The stationary state between the generator and the discriminator, also known as Nash equilibrium, is reached after a few thousands epochs. However, the agreement between the MC-and GAN-generated distributions improves in terms of χ 2 as the number of iterations increases, and levels out after about 60,000 epochs. This is true for both particle and reco level, and can be easily understood as a consequence of the stabilization of the generator losses at around the same epoch.

Event Generation and Final Results
During the training, the weights of the generator model are saved into a file every 5000 epochs and used subsequently to generate an arbitrary number of events. It takes about 80 seconds to generate 1 million events on a GPU NVIDIA Quadro P6000. After the generation, events are filtered by applying the same kinematic cuts we applied to the real MC events, i.e. both jets with p T > 250 GeV, ordered by decreasing p T . Approximately 90% fulfill these requirements and are used to fill the histograms.
Figs. 4 and 5 show the comparison of the two leading jets and dijet system kinematics at particle-and reco-level respectively, as they appear at the iteration that yields the best agreement in terms of overall χ 2 over degrees of freedom. Overall, the level of agreement is satisfactory over a large range of the kinematic regime.  We further investigated the agreement in regions of the phase space with low crosssection, in particular where the dijet invariant mass is in the multi-TeV regime. This kinematic region is of particularly interest for searches of physics beyond the SM. A very common approach is to fit the MC sample with the following four-parameters (4p) analytic function: where x = m jj / √ s and p 0 , p 1 , p 2 , p 3 are the free parameters of the fit. Such function is motivated by the structure of parton distribution functions and has been widely used by Tevatron and LHC experiments [46].
We trained the GAN using a sub-sample corresponding to about 40,000 events with m jj >1.5 TeV. Then, we used the trained model to generate 10 million events, i.e. a sample that is much larger than the one created using MadGraph5+Pythia8.
As shown in Fig.6, limiting the fit in the region between 3 and 10 TeV, the 4p analytic function can predict the shape of the MC distribution with a χ 2 /NDF = 1.04. In the same kinematic region, the sample generated with the GAN shows and agreement of χ 2 /NDF = 1.29. However, one has to take in mind that the 4p fit does not allow the user to generate event, but only to make an estimate of the background due to multijet production in that particular kinematic region and only for that specific observable. Therefore, events produced with our GAN can significantly expand the techniques used by analysis teams in determining their background.

Possible extensions of the method
The baseline architecture described in Sec. 4 can be modified for more advanced purposes that go beyond the scope of this initial work, but are relevant for practical usage in collider experiments. Both extensions were implemented and produced results compatible with those presented above. The corresponding code is available in the repository.

Arbitrary number of input variables
In future application of the method to processes whose final states involve more than two particles, it would be desirable to have a more generic handling of the input variables. A common way to achieve this is to apply a Principal Component Analysis (PCA) to the input vector, a procedure that is often referred to as whitening in Deep Learning [42]. The purpose is to reduce the dimensionality of the input vector, while at the same time retaining most of the information needed to extract useful features. PCA can be generalized to a non-linear transformation by using an unsupervised Deep Learning technique called autoencoder [43]. A deep network (encoder) transforms the input in the visible representation x vis ∈ R N to a corresponding latent representation x lat ∈ R M with lower dimensionality, i.e. M < N . Subsequently, the latent vector is fed into another network (decoder) that transforms back to the visible space. The complete chain of transformation is thus R N → R M → R N . For the purpose of this application, the generator and discriminator networks are trained using events transformed into the latent space representation (R M ) using the encoder network. After the GAN is trained, the generator is used to create events as described above. However, before filling the histograms, the generated events are transformed back to the physical representation (R N ) using the decoder network.

Conditioning on external variable
It is a common problem in searches for physics beyond the SM and precision measurements that regions of the phase space with large invariant mass or transverse momentum (typically in the TeV regime) have very low cross-sections and are hence under-represented in the simulated samples, yielding a large statistical uncertainty that may limit the sensitivity. Two classical solutions are either to generate very large samples and discard uninteresting events, or to bias the event generation by over-sampling certain regions of the phase-space. The first, brute-force approach, usually requires an unusually large CPU and disk space usage, while the second method requires events to be weighted, which can yield large fluctuations and uncertainties introduced by the event selection. A third possibility is to create so-called sliced samples, i.e. the event generation is split into a number of sub-samples in which the value of a certain variable that controls the event kinematics (such as the invariant mass or the transverse momentum of the dijet system) is limited to a certain range. The sub-samples are then added together to obtain an inclusive sample with comparable statistical uncertainty all over the range of certain variables of interest (e.g. the transverse momentum of the leading jet). In this regard, Generative-Adversarial Networks offer a similar possibility to bias the event generation without introducing event weights by the means of the so-called variable conditioning [44]. This is achieved by adding an auxiliary parameter to the input that controls the way noise is transformed into physical events. Examples of such conditioning parameter, as in the case of sliced MC samples, are the invariant mass or the transverse momentum of the dijet system, or the average number of pile-up interactions per bunch crossing µ . The number of conditioning parameters does not have to be limited to just one: for example, as in the case of the phenomenological supersymmetric standard model (pMSSM) [45], the generator network may be conditioned to create events for any given pair of (mg, mχ0) masses. This could be further extended to include more supersymmetric parameters.

Conclusions and Outlook
The Generative-Adversarial Network presented in this paper provides a novel and attractive solution to reduce the usage of CPU and possibly disk space to generate and simulate events at the LHC experiments. While still in its infancy, this method provides a unique opportunity to improve the quality of the MC used by the LHC collaborations as they will be able to use generators that are currently too time consuming to use. In the future, it should be possible to generalize this approach to more complicated processes such as top-quark pair or vector boson production in association with jets; the best MC predictions of these processes are also limited by high CPU requirements. Our results comparing simulated events show that our GAN can reproduce simulated events with high accuracy. This proof-of-concept shows the potential of these tools to provide an efficient solution to the large number of simulated events required by the ambitious physics programme of the LHC experiments. Future work will also focus on more advanced methods to further stabilize the training and avoid model collapse, while still being able to fit the relevant kinematic distributions in regions of the phase-space with low cross-sections.
In this region of the phase space, the jet mass is expected to have a peak around the top mass, which is set to 172.5 GeV in the MC simulation. Such configuration is known as "fully contained" top quarks. However, in some cases the b-quarks are produced at an angle such that only the W boson is actually found within ∆R <1.0 from the jet axis. In this case, a secondary peak appears around the W boson mass, set to 80.4 GeV in the MC simulation. The total jet mass distribution is then bi-modal. Apart from these differences, overall the events at particle level look very similar to QCD dijet ones, hence our GAN should be able to deal with this alternative physics process. It is worth stressing the fact that the information about the top quark and W boson masses are not fed into the GAN, but have to be inferred by the network during the training. As can be seen from Fig. 7, the agreement between the MadGraph5+Pythia8MC and the GAN output is in fact satisfactory. Figure 7. Comparison of kinematic observables for the all-hadronic tt production with respect to particle-level (MadGraph5+Pythia8) Monte Carlo simulation. The gray area represents the MC prediction, and the black line indicates the GAN output.
[12] Measurements of tt differential cross-sections of highly boosted top quarks decaying to all-hadronic final states in pp collisions at √ s = 13 TeV using the ATLAS detector, The ATLAS Collaboration, Phys. Rev. D 98, 012003 (2018)