Deep learning as a parton shower

We make the connection between certain deep learning architectures and the renormalisation group explicit in the context of QCD by using a deep learning network to construct a toy parton shower model. The model aims to describe proton-proton collisions at the Large Hadron Collider. A convolutional autoencoder learns a set of kernels that efficiently encode the behaviour of fully showered QCD collision events. The network is structured recursively so as to ensure self-similarity, and the number of trained network parameters is low. Randomness is introduced via a novel custom masking layer, which also preserves existing parton splittings by using layer-skipping connections. By applying a shower merging procedure, the network can be evaluated on unshowered events produced by a matrix element calculation. The trained network behaves as a parton shower that qualitatively reproduces jet-based observables.


Introduction
The renormalisation group provides a set of rules that describe how a system evolves under a re-scaling transformation.This is expressed in parton shower models by the repeated evaluation of a splitting kernel over an ordered hierarchy of scales, which results in the selfsimilarity that is a characteristic of renormalisable models.We have previously exploited this behaviour by using wavelet decomposition [1] to extract features from radiation patterns in proton-proton collision events at both small and large angles.
A single layer in a convolutional neural network (CNN) is very similar to a single level wavelet decomposition.Indeed, with appropriate network parameters, the CNN is a wavelet decomposition.Since the wavelet basis can be used to reveal the angular evolution of a parton shower, this raises the intriguing possibility that a CNN could be structured in such a way that it encodes, and behaves as, a parton shower model.That the hierarchical structure of deep learning architectures is formally connected to the behaviour of the renormalisation group is an area of active interest, see e.g.[2][3][4]; in this paper we will make the connection between these ideas obvious by constructing a toy parton shower model using a deep learning neural network whose design has been inspired by wavelet decomposition.
An autoencoding neural network takes an input of high dimensionality, compresses it to a bottleneck of a small number of network nodes, then reinflates the compressed values input pattern output pattern Figure 1.Comparison between the structure of a convolutional autoencoder and a parton shower.A pattern is input on the left and the value at each network node is determined by the weighted sum of the connected nodes.
to recover the input data as the target.In so-doing, the behaviour of the input data is encoded in the network parameters.The compression stage of a convolutional autoencoder uses a series of convolutional layers interspersed with pooling layers to repeatedly reduce the dimensionality of the input.Having compressed the data at the bottleneck, the reinflation half of the autoencoder again uses (different) convolutional layers interspersed with up-scaling.Figure 1 shows the general form of a convolutional autoencoder, where the use of gluon lines is intended to make explicit the similarity between the re-inflation stage and an iterative parton shower.
The scaling behaviour of a parton shower means that the (de-) convolutional kernels used in each layer of the autoencoding CNN should be related to those used in all the other layers.In practice, this means that the same convolutional kernels are used in each layer, which ensures self-similarity over the different angular scales that the network layers represent.The re-use of the same kernel in multiple layers also means there are a relatively small number of independent network parameters, even if the network is deep.This multiscale coarse-graining approach means that behaviour that the network learns at one angular scale is applied to all angular scales down to the cut-off.
Building a deep learning network that can approximate the behaviour of QCD is a useful exercise for several reasons: such a network can help provide insights into why neural networks (sometimes!) work so well for analysis tasks; the network can extract features and observables directly from data, which can be used to confront existing shower models; the trained autoencoder will not fit data that is different from that it is trained on, hence could be used to identify signal that differs from the QCD backgound; and the toy model trained directly on data can provide a useful comparison to existing methods for tuning Monte Carlo models.The layout of the autoencoding CNN is shown in Figure 2. A set of F 2-dimensional convolutional layers (Conv2D) is used in the compression half of the autoencoder, and F transposed 2-dimensional convolution layers (Conv2DTranspose) are used in the reinflation half of the autoencoder.The Conv2D layers are defined with kernels of size k × k and pad the input so that the output from the layer has the same size as the input.The Conv2DTranspose layers are also defined with kernels of size k × k, but use a stride size of k together with padding of the input to ensure that the output from the layer has dimensions kN × kN , where the dimensionality of the input to the layer is N × N .Event data is provided to the autoencoding CNN model in the form of pixel arrays that represent the emissions of energy from each proton-proton collision.
The input image is passed through the F Conv2D layers, which results in a stack of F output images, each of which is the same size as the original input image.A max-pooling layer is used across these F images so that each pixel site output by the max-pooling layer is the maximum value of the corresponding pixel sites in the F input images.The output of the max-pooling across the F convolutions is thus a single image the same size as the initial input image.This single image is then downsampled by a further spatial max-pooling layer.The spatial max-pooling uses a pool size of k, meaning that an initial image of size kN × kN is downsampled to a N × N image.The combined effect of the filter max-pooling followed by spatial max-pooling is to combine a k × k region of pixels into a single pixel, using the filter that best matches the shape of the input for that region.
This N × N image is once again passed to the same set of F Conv2D layers, followed by the max-pooling layers, to further reduce the dimensionality of the image.The sequence of convolution followed by max-pooling is repeated until the output image size is k × k.Note that if such a k × k image were again to be passed through the Conv2D and max-pooling layers, the result would be a single pixel.
The fully compressed k × k image is then passed to the set of F Conv2DTranspose layers, which results in a stack of F k 2 × k 2 images.The stack of upscaled images is converted to a single image by using a custom layer that we have named FilterMask.The FilterMask layer uses the corresponding k 2 × k 2 → k × k downsampling in the compression stage of the autoencoder to decide which of the pixels in the stack of images output by the Conv2DTranspose layers should be used.FilterMask is a stack of F k 2 × k 2 images in which each pixel is either zero or one.Pixels are one if the corresponding pixel in the Conv2D output image is the maximum in that stack of pixels.All other pixels in the FilterMask are zero.The output stack of images from the Conv2DTranspose is multiplied by the stack of FilterMask images, so that at most one pixel in the stack is non-zero.This set of masked images is then converted to a single image by summing all of the pixels in a stack.The FilterMask thus transfers information about which Conv2D filter was active in the compression stage to the reinflation stage, meaning that the Conv2DTranspose filter kernels are dependent on the upstream Conv2D kernels.This also means that there is a mechanism by which a splitting present in the input event can be preserved through the network, while admitting a degree of randomness.The FilterMask keeps a counter of the number of times each Conv2D filter was active during training, and converts these rates into a set of probabilities that have unit sum.In the case that all Conv2D filters produce identical (therefore zero) output in the same pixel, the FilterMask randomly picks a single filter to activate according to its recorded probability.This means that if a single active pixel is passed into the CNN, it will be propagated to the k × k bottleneck as a single pixel but then reinflated using randomly activated Conv2DTranspose upscaling filters.It is this feature of the FilterMask -preserving input splittings when they exist, while producing random splittings when none are present -that allows the autoencoding CNN to behave as a parton shower.
Having inflated the k×k image to k 2 ×k 2 and applied the FilterMask derived from the k 2 ×k 2 → k×k compression stage, the output k 2 ×k 2 image is once again passed to the stack of Conv2DTranspose filters, which output a stack of k 3 × k 3 images.A FilterMask is once again used to merge the stack, but this time the mask is taken from the k 3 × k 3 → k 2 × k 2 compression stage.Different FilterMask layers, with different learned filter probabilities, are thus used at the different compression and reinflation levels.This allows the filter activation probabilities to evolve with angular scale.
The process of applying the stack of Conv2DTranspose filters, followed by merging with the corresponding FilterMask from the compression stage, is repeated until the image is the same size as the original input image.
The CNN is implemented in python using Keras [5] with the TensorFlow [6] backend and is evaluated using a pair of Nvidia graphics processing units (GPUs).

Loss Function
The loss function of a neural network is the objective that should be minimised in order to best describe the input data.A common simple loss function for a convolutional autoencoder is to take the mean squared error (MSE) between the input and output image, summing over all the pixels.Minimising the MSE means that the output of the network is as similar as possible to the input.
However, there are two problems with such a loss function.MSE is very susceptible to aliasing effects in which an output emission is in a neighbouring pixel to a similar input emission.The MSE penalises the network equally whether it produces an emission near to a target emission or very far away.The second problem is that the input event images are sparse; there are 4096 pixels in a 64 × 64 grid, but a single event may contain only O (10 − 100) emissions.Using a naive MSE loss means that the CNN will mainly learn about empty pixels, and will be biased towards producing no output activity.
These two problems are solved by modifying the naive MSE. 1 Both the input target and the CNN output are blurred by using a set of truncated Gaussian kernels.A MSE-type loss is calculated for each Gaussian kernel, and a weighted sum of the losses is performed as in equation 2.1 where T αβ is the target event image that is input to the CNN and O αβ is the output image of the CNN.G i γδ are a set of truncated Gaussian kernels and T i γδ and O i γδ are versions of the input and output, respectively, that have been blurred by G i γδ .The loss function, L, is the weighted sum over an MSE-like function, M T i γδ , O i γδ using weights w i .The truncated Gaussian kernels are given by equation 2.2 0.0947 0.118 0.0947 0.118 0.148 0.118 0.0947 0.118 0.0947 2) The MSE-like function has two contributions, one from pixels in which there is activity in the target image, and one from pixels in which there is no target activity.The function where M αβ is a mask image whose pixels have value 1 if the corresponding pixel in T i αβ is non-zero, and are zero otherwise.The effect of equation 2.3 is to treat all pixels that do not have any target activity in them as a single pixel.
The weights, w i , give some control over the degree to which the loss penalises the network for producing activity at a large angle to a target emission.Increasing weight w 3 , which applies to the G 3 Gaussian kernel, causes the loss function to reduce the penalty for producing emissions at a wide angle to the target emission.Nevertheless, regardless of the weights chosen, the loss is always minimised by producing output in exactly the same pixel as the target.

Regularisation of Network Kernels
During training of the network, it may be possible for the convolutional kernel weights to diverge or become infinitesimal.In order to prevent this, dropout layers that randomly mask the output of the convolutional layers would typically be used to regularise the kernel weights.However, dropout cannot be used here because the max-pooling across convolutional filters is non-linear.Dropout works well when the network approximates a linear summation of neurons because it allows many subsets of the available network structures to be explored.However, in this model the convolutional kernels interact with each other, so it is not possible to drop a single network node without radically altering the network behaviour.
In lieu of dropout, a regularisation penalty is applied to prevent the learned convolutional kernel values from diverging.The kernels are similar to shower splittings, and so a regularisation term is added that penalises kernels that deviate from energy conservation.The kernel penalty term for each Conv2D filter with kernel weights C is given by equation 2.4  The choice of the size of the filter bank, F , is initially inspired by the desire for rotational invariance in the k 2 model.With k = 2 there are four possible rotations of the kernel, plus a parity transformation, meaning that eight filters can cover all possible transformations of one kernel.One additional filter is added in order to allow for a nonsplitting.For the larger k 3 kernel, it is found that training a corresponding filter bank size of F = 17 is prohibitive, so the number of filters is reduced to F = 7.There remains sufficient flexibility in the model that it is still capable of encoding the shower.Future improvements to the training procedure, either in the loss function or the optimisation algorithm, may permit a larger filter bank.
The other model hyper-parameters are decided by training models with different sets of parameter values using a small sample of the training events and inspecting the evolution of the training and validation losses.

Monte Carlo Event Samples and Selection
Simulated samples of proton-proton events are needed in order to train and to test the CNN models.Sherpa 2.2.4 [7][8][9][10][11] is used to generate a sample of 8.5 million QCD proton-proton collision events with up to four outgoing parton legs in the matrix element calculation.The default Sherpa tune is used, with the NNPDF 3.0 PDF set [12] and a shower merging scale of 20 GeV.The beam energy is 6.5 TeV per proton.Hadronisation and multi-parton interactions (MPI) are turned off because they have different scaling characteristics to a pure parton shower model, and would therefore require a more complicated deep learning model than is studied here. 2A generator-level event selection is made that requires at least two R=0.4 jets with p T greater than 25 GeV in the matrix element calculation.
A subsequent event selection is made on the post-showering final-state particles that requires at least two R=0.4 anti-k t jets [13] with p T greater than 40 GeV and rapidity satisfying |y| < (π − 0.4) in each event.The anti-k t jet algorithm is run via the FastJet [14] library.This criterion selects approximately 0.5 million events from the initial 8.5 million generated events.
The CNN model requires pixel arrays as inputs.Each of the selected Sherpa events is converted to N × N pixel-array images in rapidity-azimuth (y − φ) by identifying the pixel in the y − φ plane into which each particle is emitted and adding the particle p T to the pixel value.Pixels have a value of zero if no particles are emitted into them.The pixel array covers the rapidity range −π ≤ y < π and the azimuthal range 0 ≤ φ < 2π.Each pixel array is normalised by dividing by the total sum of pixel values in the array and multiplying by N × N so that the average pixel value is one.Pixel arrays of size 64 × 64 and 81 × 81 are produced for use with models k 2 and k 3 , respectively.
In order to test the trained CNN shower models, unshowered partons produced by a matrix element calculation are needed.An additional sample of 8.5 million matrix element (ME) Sherpa events generated without any parton shower is used for this purpose.These events have the same generator-level process and selection of two R=0.4 jets with p T greater than 25 GeV, but lack any post-shower selection.The full sample of 8.5 million ME events is converted to N × N pixel arrays, but is not normalised.

Training on Showered Events
The fully showered event pixel arrays are divided into a training sample, which contains 90% of the 0.5 million images, and a validation sample, which contains the remaining 10%.
Models are trained for several hundred epochs using the learning rates given in Table 1.The Nadam optimiser [15,16]  for some number of epochs in a moderate loss state, before falling into a lower loss state.After spending a small number of epochs in the low loss state, the model then undergoes a rapid increase in both training and validation loss and reaches a high loss state, before falling once again to a (different) moderate loss state.This evolution of both training and validation loss is shown in Figure 3.Note the lack of divergence between the training and validation loss, indicating that the rapid increase is not due to overfitting the training data.The lack of divergence is seen regardless of the choice of model hyper-parameters (section 2.4).This lack of over-fitting is probably due to the high dimensionality of the input data together with the relatively high statistical power of the input and the rather small number of learned parameters.
The probabilities stored in the FilterMask layers are the rates with which each filter is active, and the Shannon entropy of those probabilities is a measure of the complexity of the model state.If the Shannon entropy is high, the filters are all active with similar rates, whereas if the entropy is low then a small number of the filters dominate the description of the collision events.Figure 3 b) shows the evolution of the Shannon entropy of the model during the same training period as Figure 3 a).The entropy is normalised so that each FilterMask layer encodes at most one bit of information, and for k = 2 there are five FilterMask layers, so there is a maximum of five bits stored in the model.At the start of training, the model is in a high entropy state because the filters are active with quasirandom probabilities.When, during training, the loss falls to a minima, the entropy also declines to moderate values, showing that the model is in a more ordered state.When the loss subsequently rises rapidly, the model undergoes a transition to a higher entropy state again.When trained for a very large number of epochs (not shown here), the entropy of the model gradually evolves to a very low entropy state, while the loss remains on a plateau.This long-term decline in entropy accompanied by a near-constant loss indicates that there are a large number of model states that are equally good at describing the collision events, and the model tends towards a state in which the model behaviour is dominated by a small number of the filters in the filter bank.This behaviour of the model complexity could in future potentially be used in more effective network training algorithms.

Comparison With Class IV Cellular Automata
Parton showers and the CNN implemented here are similar in both conception and behaviour to Cellular Automata (CA).Cellular Automata evolve a system from an initial state using a set of rules that describe how the current state should change under a discrete step.Parton showers employ a set of splitting rules to evolve the state of the shower between scales.Similarly, the CNN uses rules described by the Conv2D or Conv2DTranspose layers to step between angular scales.The change in the CNN during training shares some interesting aspects with the change in behaviour of CA as they move through their available "rule space." Cellular automata have been divided into four classifications [17][18][19]: • Class I: The initial state evolves to a fixed pattern, and is not interesting for the present study • Class II: The evolution from the initial state is dominated by well-ordered periodic structures that are largely independent of the initial state.
• Class III: The system evolves chaotically and produces random patterns that are independent of the initial state.
• Class IV: The system evolves to produce complex states that are neither completely chaotic, nor completely ordered.The evolution is dependent on the initial state, and the rules interact in a non-linear way to produce complex behaviour.
One key indicator of the difference between classes II, III and IV is the entropy of the CA site activity.Class II typically produces low entropy states, while Class III produces high entropy states.As the rules of the CA are altered, it can undergo a (potentially rapid) phase transition between the classes.
Prior to training,3 the CNN behaves very much like a Class III Cellular Automata.The network weights are initialised to random values and produce a large number of random emissions, uncorrelated with the input to the network.
After over-training4 for many hundreds of epochs, the network behaves like a Class II Cellular Automata.The Shannon entropy of the FilterMask layers declines, indicating that the model is highly-ordered and a small number of kernels dominate the evolution of the input through the network.The evolution of the CNN from chaotic to highly ordered behaviour is illustrated by some radiation patterns from model k 2 in Figure 4.As the rules of the CA are updated so that a transition from class II to class III is made, there can -depending on the specifics of the CA -be a Goldilocks region in which the CA is class IV and is capable of describing complex phenomena.Similarly, as the CNN kernels are updated during training, it becomes capable of describing the complex behaviour of a parton shower close to the transition from chaos to periodicity.The goal for training a CNN capable of behaving as a parton shower should thus be to maximise the time spent exploring the kernel parameters in the transition region.

Merging CNN with Matrix Element Calculations
The trained network is evaluated on the pixel arrays produced from Sherpa matrix element events that have not previously been showered.The effect of the FilterMask layer in this case is to randomly activate filters according to the rate with which they were active during training.The output of the evaluation on un-showered events is an approximation to fully showered events.The output of the CNN is a pixel array of the same size as the input.These output arrays are converted back into lists of particle-level collision events by creating a single particle for each pixel that has a p T value above 100 MeV.The particle p T is the same as the pixel p T , and the particle is emitted into a random location within the pixel.
Care must be taken to merge parton showers with perturbative matrix element calculations in order to prevent the double-counting of emissions from both the shower and the ME into the same region of phase space.The Sherpa matrix elements used here assume a k T ordered parton shower, where k T is the transverse momentum of an emitted parton relative to the emitter.Although the CNN does not currently explicitly define an ordering parameter, it most closely resembles an angular ordering.This mis-match between the ordering used in the ME and the CNN implies that a veto should be applied to the CNN to prevent emission of shower partons into phase-space regions that should be covered by the ME [20].This shower merging is implemented as a new layer in the CNN that is added after each Conv2DTranspose level.The shower merging is only used during evaluation of the CNN on pixel arrays produced from matrix element events.The shower merging is not used during training on fully showered events.
Each application of the Conv2DTranspose filters corresponds to the generation of potential new parton emissions, which is why the shower merging veto is applied after each new application of the FilterMask and Conv2DTranspose layers.The shower veto scale evolves with depth in the CNN.The network bottleneck in the middle of the autoencoder corresponds to the widest angle emissions, and is analogous to the shower starting scale.The shower veto scale at the bottleneck is therefore that provided by Sherpa's matrix element, in this case 20 GeV.For each step in network level away from the bottleneck, the shower veto is divided by a factor of k, which is the size of the convolutional kernel.Thus the later layers of the re-inflation stage, which correspond to smaller angles, use a smaller veto scale.
At each level of the re-inflation, the veto procedure is applied as follows: • The output of the merged Conv2DTranspose bank is sub-divided into windows the same size as the k × k convolutional kernel.
• Any pixel below the veto scale for that convolutional level is left unchanged.
• The number of pixels, N S , in each window that are above the veto scale is calculated.
• The number of pixels, N M E , in each window of the corresponding layer on the compression side of the CNN is calculated.
• If N M E ≥ N S then all pixels in the output window are accepted because no new emissions above the merging scale have been added in that region.
• If N M E < N S then the pixels above the merging scale in that window are replaced by those in the corresponding window of the image from the compression-side of the CNN.These pixels correspond to the state of the shower prior to splitting.
• Replacement hard pixels are adjusted to account for the emission of any soft pixels below the merging scale within the same window.
This merging procedure can be carried out entirely via matrix manipulation operations, and is implemented as a Keras layer and executed on the GPU.

Jet Distributions Predicted by the CNN
The Rivet analysis framework [21] is used to compare events showered with Sherpa to events showered with the CNN, as well as ME-level events that have not been showered.A sample of eight million events produced with Herwig 7.1.4[22][23][24][25][26] is also used as an example of an angular ordered shower for comparison with Sherpa's k T ordered shower.For these events, the default Herwig tune is used with a beam energy of 6500 GeV per proton.The QCD 2 → 2 process is used and, for a direct comparison with the Sherpa samples, both hadronisation and MPI are turned off.
Analyses are performed using two different jet algorithms from the FastJet package: the anti-k t algorithm with a radius parameter of 0.4, and the k t algorithm [27] with a radius of 0.6.Jets must have a rapidity that satisfies |y| < 2.5 and transverse momentum that satisfies p T > 40 GeV.
Some example emission patterns generated by the CNN models k 2 and k 3 are shown in Figure 5 for events that satisfy the anti-k t R=0.4 jet selection.The ME partons that are used to seed the CNN are shown as grey crosses.The CNN splits these partons into a shower of lower energy partons in the region of the initial parton.The CNN is also able to generate some wider-angle activity; for example, in the middle panel, model k 2 has generated a jet around {y, φ} = {−π/2, π/4}.
The distribution of the number of jets in each event that satisfy the jet requirements are shown in Figure 6.The matrix element can (rarely) generate at most four partons, so events that contain more than four jets have had those jets generated by the parton shower.Model k 2 produces somewhat fewer high multiplicity events than the target Sherpa model, while model k 3 produces slightly too many high multiplicity events.It is encouraging that the two CNN models bracket the target data, which suggests that a CNN could indeed be made to describe the jet multiplicity after further adjustment and improvement.
The jet width, 5 ρ, is a test of the shape of the radiation pattern emitted around a jet, and is shown in Figure 7 for all jets that satisfy the selection criteria.The simple CNN models do a surprisingly good job of recreating the jet shapes of the true parton shower, especially for the large width jets.There is a deficiency in small width jets compared to Sherpa, and an over-abundance of zero-width jets.This suggests some kind of dead cone effect, which could be an artefact of the approximate merging procedure, or some other effect of using an angular ordered-type shower.By way of comparison, Herwig's angularordered shower also displays a similar dip in the number of low width jets and shows the range of expected differences between an angular-ordered shower and a k T ordered shower.The CNN models have no information about parton mass, and also have a cut off at small angle due to the finite pixel size, both of which may affect the small width jets to some extent.
Jet masses arise from the finite width of the jet, and jet mass distributions also serve as a test of the radiation emitted around a jet.The distributions of jet masses from all 5 ρ is given by ρ where the sum is over all constituents of the jet, p i T is the pT of the i th jet constituent and ∆R (j, pi) is the angular separation between that constituent and the jet axis.selected jets are shown in figure 8.Both the k 2 and k 3 CNN models have generated smooth mass distributions from the input ME partons, with gradients close to those of the target Sherpa model in the tails.However, the peak of the mass distributions do not match the target.This is not surprising because the CNN models do not contain any information about mass and do not trace the parton masses through the network; jet masses arise only from the angular width of the jets.Furthermore, the existence of massive b and c quarks can be seen in the Sherpa mass distribution as the small spikes at around 4.5 and 1.7 GeV, respectively.Since the CNN does not include any mass term for the partons (or pixels) it cannot reproduce these spikes.Again, the Herwig shower is shown as an example of the differences that can be expected between angular and k T ordered showers, in particular in the low mass region.Finally, the transverse momentum (p T ) distributions of all jets that satisfy the selection

Features learned by the CNN
Each FilterMask layer corresponds to a different angular scale, with deeper layersthose closer to the bottleneck -representing larger angles.The probabilities stored in the FilterMask layers are the activation rates of the different filters.Figure 10 shows the evolution of the filter activation rates with the angular scale.The filters are labelled feature 1-F , where F is the number of filters in the model and feature 1 is chosen as the filter with the largest activation rate at small angles, while feature F is the filter with the smallest activation rate at small angle.The activation probabilities exhibit some interesting behaviour that is suggestive of interactions between the different filters.For example, model k 2 appears to show bands of filters having similar activation rates, while model k 3 also shows apparent correlations in the activation rates at different scales.
The features that each filter in the model encode are revealed by fixing the model weights so that only a single pair of Conv2D and Conv2DTranspose filters is active.A randomly generated pixel array is input to this sub-model and then updated using the output of that model.The update is repeated twenty times until a stable pixel array is converged upon.This iterative procedure is itself repeated using one hundred different random starting arrays, and the average is taken of the results.
The features encoded by the nine filters in model k 2 are shown in Figure 11, and the seven features encoded by the filters of model k 3 are shown in Figure 12.All the features exhibit self-similarity, which is a result of the use of the same filter at multiple angular scales.The features encoded in model k 3 are more complex than those in model k 2 due to the larger kernel size of the former.

Concluding Remarks
We have demonstrated that it is possible to encapsulate many of the features of a QCD parton shower in an autoencoding convolutional neural-network, and that the number of trainable network parameters needed to do so is not large.The network design is inspired by self-similarity and wavelet decomposition, which significantly reduces the number of network parameters.The CNN learns features from jet events without needing any human supervision to classify the events or objects within them.Convolutional neural networks have often been used as a tool for event or jet classification [28][29][30], but here we have shown that they may also be used to learn features from QCD directly, with minimal assumptions about jet behaviour.Indeed, jets themselves are but a social construct, and using concepts like quark/gluon tagging as a target for neural networks imposes human (mis-) concepts and biases onto the data.The basis set of features learned during the course of training a jet tagger is in many ways more useful than the tagger itself.Here we have attempted to ask the more general question "what features may be learned from QCD?" Convolutional networks using images have been criticised as a tool for QCD for using a large number of learned network weights.For example, [31] uses a recursive neural network (RNN) directly on particle four-vectors.We have shown that a large number of learned CNN parameters is not necessary to describe QCD, and that self-similarity can be encoded even without a recursive network.The RNN and autoencoding CNN use similar ideas to capture the behaviour of QCD, though with slightly different aims.The CNN has the advantage that it does not need a target jet algorithm model (indeed it does not need jets at all), while the RNN has the advantage of operating directly on particle four-vectors.
This method also bears some comparison with shower deconstruction [32].In shower deconstruction, a simplified parton shower model is used to estimate the probability that a given parton configuration originates from either signal or background.The probability is estimated by evaluating the history of the splitting terms that lead to the final parton configuration.Similarly, the compression stage of the autoencoder matches a series of learned kernels to the input parton configuration, taking the best match at each stage via the max-pooling operation.In this way, the autoencoder tests, in parallel, a very large number of possible shower histories.Similarly to shower deconstruction, an autoencoder trained on QCD background data could be used to discriminate between background and signal because the non-QCD signal will not match the learned shower behaviour.
Though far from perfect, the simplified models used here are able to capture qualitatively the behaviour of the Sherpa target on which they are trained.The models could equally have been trained directly on appropriate jet data from the Large Hadron Collider, with the proviso that some important non-perturbative effects (hadronisation and MPI) were (intentionally) left out here.This raises the interesting possibility that, with further improvement -particularly by adding a mass term -deep neural networks can provide a new way of learning about QCD that allow for data-driven background models that do not use any assumptions beyond the basic network structure.

Figure 2 .
Figure2.Overview of the autoencoding CNN.An input event image enters the network on the left and is repeatedly processed by a bank of F Conv2D filters.Having been completely compressed, the event is reinflated using Conv2DTranspose layers together with a special FilterMask layer.

2. 4
Model ParametersHaving provided the general network structure, two sets of hyper-parameters define two concrete implementations of the CNN model.Model k 2 uses a kernel size of k = 2, and model k 3 uses a kernel size of k = 3.The max-pooling and up-scaling used by the model requires that the input pixel array dimensions, N × N , must obey the rule N = k n , where n is an integer.Model k 2 is defined using input pixel arrays of size 64 × 64, and model k 3 uses inputs of size 81 × 81.The kernel and input array size together define the number of levels of convolution that the model performs; model k 2 is a narrower but deeper model with more levels of decomposition, while model k 3 uses a wider kernel but fewer levels of decomposition.The model hyper-parameters and other details are given in table1

Figure 3 .
Figure 3. Evolution of the model during training.a), left, shows how the loss function evaluated on the training and validation event samples changes during training, while b), right, shows the evolution of the Shannon entropy of the FilterMask layer.

Figure 4 .
Figure 4. Example emission patterns produced by an untrained (chaotic) CNN (a) and an overtrained (periodic) CNN (b).The input to the CNN was the same in both cases.

Figure 4 a
) is the output of the untrained model k 2 when two random partons are used to seed the CNN.The output in Figure4a) is chaotic, there is no pattern and the output is uncorrelated with the input.

Figure 4 b
) shows the output from the same model k 2 using the same input when the model is over-trained by several hundred epochs.The model output has become sparse, with well-ordered structures that are repeated over different angular scales.

CNN Model k 3 Figure 5 .
Figure 5. Three example emission patterns generated by the k 2 (left column) and k 3 (right column) CNN models.The coloured circles indicate the location of emitted particles, and the sizes and colours of the circles indicate the particle T .The grey Xs in each plot are the matrix element emissions that are input to the CNN.

Figure 10 .
Figure 10.Evolution of the feature activation rates with angular scale for model k 2 (left) and model k 3 (right).

Table 1 .
Model hyper-parameterswhere λ is a multiplier that controls the strength of the regularisation and the summation is over all of the kernel weights in the Conv2D layer.No regularisation is applied to the Number of jets per using the anti-k t R=0.4 jet algorithm (left) and the k t R=0.6 algorithm (right).