Introduction

Precisely measuring nature’s fundamental parameters and discovering new elementary particles in modern high energy physics is only made possible by our deep mathematical understanding of the Standard Model and our ability to reliably simulate interactions of these particles with complex detectors. While essential for our scientific progress, the production of these simulations is increasingly costly. This cost is already a potential bottleneck at the LHC, and the problem will be exacerbated by higher luminosity, larger amounts of pile-up and more complex and granular detectors at the high-luminosity LHC and planned future colliders. A promising way to accelerate the simulation is offered by generative machine learning models and was pioneered in Ref. [1]. The present work focuses on simulating a very high-resolution calorimeter prototype with greater fidelity of physically relevant distributions, paving the road for practical applicationsFootnote 1.

Advanced machine learning methods, based on deep neural networks, are rapidly transforming and improving the way to explore the fundamental interactions of nature in particle physics—see for example Ref. [2] for a recent overview of neural network architectures developed to identify hadronically decaying top quarks. However, we are only beginning to explore the potential benefits from unsupervised techniques designed to model the underlying high-dimensional density distribution of data. This allows, e.g., anomaly detection algorithms to identify signals from new physics theories without making specific model assumptions [3,4,5,6,7,8,9,10,11,12]. Furthermore, once the phase space density is encoded in a neural network, it can be sampled from very efficiently. This makes synthetic models of particle interactions many orders of magnitude faster than classical approaches, where for example for a particle showering in a calorimeter many secondary shower particles have to be created and individually tracked through the material of the detector according to the underlying physics processes.

Calorimeters are a crucial part of experiments in high energy physics, where the incident primary particles create showers of secondary particles in dense materials that are used to measure the energy. In sandwich calorimeters, layers of dense materials are interleaved with sensitive layers recording energy depositions from secondary shower particles mostly from ionization. The details of the shower development via creation of secondary particles as well as their energy loss is typically simulated with great accuracy using the Geant4 [13] toolkit.

The crucial role of calorimeter simulation as a time-consuming bottleneck in the simulation chain at the LHC is well established. For example, the ATLAS experiment uses more than half of its total CPU time on the LHC Computing Grid for Monte Carlo simulation, which in turn is entirely dominated by the calorimeter simulation [14].

While generative neural network techniques promise enormous speed-ups for simulating the calorimeter response, it is of extreme importance that all relevant physical shower properties are reproduced accurately in great detail. This is particularly challenging for highly granular calorimeters, with a much higher spatial resolution, foreseen for most future colliders. Such concepts, as developed for the International Linear Collider (ILC), are also being used to upgrade detectors at the LHC for upcoming data-taking periods. One prominent example is the calorimeter endcap upgrade of the CMS experiment [15] with about 6 million readout channels. These factors make the timely development of precise simulation tools for high-resolution detectors relevant and motivate our investigation of a prototype calorimeter for the International Large Detector (ILD).

Outside of particle physics, generative adversarial neural networks [16] (GANs) have been used to produce synthetic data—such as photo-realistic images [17]—with great success. A traditional GAN consists of two networks, a generator and a discriminator separating artificial samples from real ones, which are trained against each other. An alternative to GANs for simulation are variational autoencoders [18] (VAE). A VAE consists of an encoder mapping from input data to a latent space, and a decoder, which maps from the latent space to data. If the probability distribution in latent space is known, it can be sampled from and used to generate synthetic data. A third path towards generative models is offered by normalizing flows [19,20,21,22,23]. In such models, a simple base probability distribution is transformed by a series of invertible mappings into a complex shape.

Recently, a novel architecture unifying several generative models such as GANs, VAEs, and others was proposed: the Bounded-Information-Bottleneck autoencoder (BIB-AE) [24]. We will show that by using a modified BIB-AE for generation we can accurately model all tested relevant physics distributions to a higher degree than achieved by traditional GANs. A detailed introduction to this architecture is provided in Sect. 3.3.

Specifically in particle physics, first results for the simulation of calorimeters focused on GANs achieved an impressive speed-up by up to five orders of magnitude compared to Geant4 [1, 25, 26]. Similarly, an approach using a Wasserstein-GAN (WGAN) architecture achieved realistic modeling of particle showers in air-shower detectors [27] and a high granularity sampling calorimeter [28]. In the context of future colliders, an architecture inspired by GANs was used for the fast simulation of showers in a high granularity electromagnetic calorimeter [29]. Generative models based on VAE and WGAN architectures were studied for concrete application by the ATLAS collaboration [30,31,32].

Beyond producing calorimeter showers, generative models in HEP have also been explored for modeling muon interactions with a dense target [33], parton showers [34,35,36,37], phase space integration [38,39,40,41], event generation [42,43,44,45,46,47], event subtraction [48] and unfolding [49].

The rest of this paper is organised as follows: in Sect. 2 we introduce the concrete problem and training data, in Sect. 3 the used generative architectures are discussed, and in Sect. 4 the obtained results are presented and compared. Finally, Sect. 5 provides conclusions and outlook.

Data Set

Fig. 1
figure 1

A simulated 60 GeV photon shower in the ILD detector, as used in the training data

The ILD [50] detector is one of two detector concepts proposed for the ILC. It is optimized for Particle Flow, an algorithm that aims at reconstructing every individual particle in order to optimize the overall detector resolution. ILD combines high-precision tracking and vertexing capabilities with very good hermiticity and highly granular electromagnetic and hadronic calorimeters. For this study, one of the two proposed electromagnetic calorimeters for ILD, the Si-W ECal is chosen. It consists of 30 active silicon layers in a tungsten absorber stack with 20 layers of \(2.1 \, \text {mm}\) followed by ten layers of \(4.2\, \text {mm}\) thickness respectively. The silicon sensors have \(5\times 5\, \text {mm}^2\) cell sizes. Throughout this work, we project the sensors onto a rectangular grid of \(30\times 30\times 30\) cells. Each cell in this grid corresponds to exactly one sensor. As the underlying geometry of sensors in a realistic calorimeter prototype is not exactly regular, we will encounter some effects of this staggering. This makes the learning task more challenging for the network, but does not pose a fundamental problem. Architectures that more accurately encode irregular calorimeter geometries in neural networks exist [51], but are not the focus of this work.

Fig. 2
figure 2

Overlay of 2000 projections of 50 GeV Geant4 photon showers along the y direction

ILD uses the iLCSoft [52] ecosystem for detector simulation, reconstruction and analysis. For the full simulation with Geant4, a detailed and realistic detector model implemented in DD4hep [53] is used. The training data of photon showers in the ILD ECal are simulated with Geant4 version 10.4 (with QGSP_ BERT physics list) and DD4hep version 1.11. The photons are shot at perpendicular incident angle into the ECal barrel with energies uniformlyFootnote 2 distributed between 10 and 100 GeV. All incident photons are aimed at the \(x{-}y\) center of the grid—i.e., at the point in the middle between the four most central cells of the front layer. An example event display showing such a photon shower is depicted in Fig. 1.

The incoming photon enters from the bottom at \(z=0\) and traverses along the z-axis, hitting cells in the center of the \(x{-}y\) plane. No variations of the incident angle and impact point are performed in this study. The overlay of 2000 showers summed over the y-axis is shown in Fig. 2. As can be seen, the cells in the ILD ECal are staggered due to the specific barrel geometry. The whole data set for training consists of 950k showers with continuous energies between 10 and 100 GeV. For the evaluations we generated additional, statistically independent, sets of events: 40k events uniformly distributed between 10–100 GeV and 4k events each at discrete energies in steps of 10 GeV between 20 and 90 GeV.

Generative Models

Generative models are designed to learn an underlying data distribution in a way that allows later sampling and thereby producing new examples. In the following, we first present two approaches—GAN and WGAN—which represent the state-of-the-art in generating calorimeter data and which we use to benchmark our results. We then introduce BIB-AE as a novel approach to this problem and discuss further refinement methods to improve the quality of generated data.

Generative Adversarial Network

The GAN architecture was proposed in 2014 [16] and had remarkable success in a number of generative tasks. It introduces generative models by an adversarial process, in which a generator G competes against an adversary (or discriminator) D. The goal of this framework is to train G in order to generate samples \(\widetilde{x}=G(z)\) out of noise z, which are indistinguishable from real samples x. The adversary network D is trained to maximize the probability of correctly classifying whether or not a sample came from real data using the binary cross-entropy. The generator, on the other hand, is trained to fool the adversary D. This is represented by the loss function as

$$\begin{aligned} \begin{aligned} L = \min _G \max _D \mathbb {E}[\log D(x)] + \mathbb {E}[\log (1-D(G(z)))], \end{aligned} \end{aligned}$$
(1)

and a schematic of the GAN training is provided in Fig. 3 (top).

Fig. 3
figure 3

Overview of the GAN (top) and WGAN (bottom) architectures. The blue line shows where the true energy is used as an input. The loss functions and feedback loops are explained in the text

For practical applications, the GAN needs to simulate showers of a specific energy. To this end, we parameterise generator and discriminator as functions of the photon energy E [54]. In general, we attempted to minimally modify the CaloGAN formulation [26] to work with the present dataset.

The original formulation of a GAN produces a generator that minimizes the Jensen–Shannon divergence between true and generated data. In general, the training of GANs is known to be technically challenging and subject to instabilities [55]. Recent progress on generative models improves upon this by modifying the learning objective.

Wasserstein-GAN

One alternative to classical GAN training is to use the Wasserstein-1 distance, also known as earth mover’s distance, as a loss function. This distance evaluates dissimilarity between two multi-dimensional distributions and informally gives the cost expectation for moving a mass of probability along optimal transportation paths [56]. Using the Kantorovich-Rubinstein duality, the Wasserstein loss can be calculated as

$$\begin{aligned} L = \text {sup}_{f\in \text {Lip}_1}\{\mathbb {E}[f(x)] - \mathbb {E}[f(\tilde{x})]\}. \end{aligned}$$
(2)

The supremum is over all 1-Lipschitz functions f, which is approximated by a discriminator network D during the adversarial training. This discriminator is called critic since it is trained to estimate the Wasserstein distance between real and generated images.

In order to enforce the 1-Lipschitz constraint on the critic [57], a gradient penalty term should be added to (2), yielding the critic loss function:

$$\begin{aligned} L_{\text {Critic}}&= \mathbb {E}[D(G(z))] - \mathbb {E}[D(x)] \nonumber \\&\quad + \lambda \,\, \mathbb {E}[(\parallel \nabla _{\hat{x}}D(\hat{x})\parallel _2 - 1)^2 ], \end{aligned}$$
(3)

where \(\lambda \) is a hyper parameter for scaling the gradient penalty. The term \(\hat{x}\) is a mixture of real data x and generated G(z) showers. Following [57], it is sampled uniformly along linear interpolations between x and G(z).

Finally, we again need to ensure that generated showers accurately resemble photons of the requested energy. We achieve this by parameterising the generator and critic networks in E and by adding a constrainer [28] network a. The loss function for the generator then reads:

$$\begin{aligned} L_{\text {Generator}}&= -\mathbb {E}[D(\tilde{x},E)] \nonumber \\&\quad + \kappa \cdot \mathbb {E}[\left| (a(\tilde{x}) - E)^2 - (a(x) - E)^2 \right| ], \end{aligned}$$
(4)

where \(\tilde{x}\) are generated showers and \(\kappa \) is the relative strength of the conditioning term. This combined network is illustrated in Fig. 3. The constrainer network is trained solely on the Geant4 showers; its weights are fixed during the generator training. We use the mean absolute error (L1) as lossFootnote 3:

$$\begin{aligned} L_\text {Constrainer} = \left| E - a(x)\right| . \end{aligned}$$
(5)

Bounded Information Bottleneck-Autoencoder

Fig. 4
figure 4

Diagram of the BIB-AE architecture, including the additional MMD term defined in Sect. 3.4 and the Post Processor Network defined in Sect. 3.5. The blue line shows where the true energy is used as an input. The loss functions and feedback loops are explained in the text

Autoencoder architectures map input to output data via a latent space. Using a structured latent space allows for later sampling and thereby generation of new data. The BIB-AE [24] architecture was introduced as a theoretical overarching generative model. Most commonly employed generative models—e.g. GAN [16], VAE [18], and adversarial autoencoder (AAE) [58]—can be seen as different subsets of the BIB-AE. This leads to better control over the latent space distributions and promises better generative performance and interpretability. In the following, we focus on the practical advantage gained from utilizing the individual BIB-AE components and refer to the original publication [24] for an information-theoretical discussion.

As it is an overarching model, an instructive way for describing the base BIB-AE framework is by taking a VAE and expanding upon it. A default VAE consist of four general components: an encoder, a decoder, a latent-space regularized by the Kullback–Leibler divergence (KLD), and an \(L_N\)-norm to determine the difference between the original and the reconstructed data. These components are all present as well in the BIB-AE setup. Additionally, one introduces a GAN-like adversarial network, trained to distinguish between real and reconstructed data, as well as a sampling based method of regularizing the latent space, such as another adversarial network or a maximum mean discrepancy (MMD, as described in the next section) term. In total this adds up to four loss terms: the KLD on the latent space, the sampling regularization on the latent space, the \(L_N\)-norm on the reconstructed samples and the adversary on the reconstructed samples. The guiding principle behind this is that the two latent space and the two reconstruction losses complement each other and, in combination, allow the network to learn a more detailed description of the data. Specifically looking at the two reconstruction terms we have, on the one hand, the adversarial network: from tests on utilizing GANs for shower generation we know that such adversarial networks are uniquely qualified to teach a generator to reproduce realistic looking individual showers. On the other hand, we have the \(L_N\)-norm: while our trials with pure VAE setups have shown that \(L_N\)-norms have great difficulty capturing the finer structures of the electromagnetic showers, an \(L_N\)-norm also forces the encoder-decoder structure to have an expressive latent space, as the original images could not be reconstructed without any latent space information. Therefore, the adversarial network forces the individual images to look realistic, while the \(L_N\)-norm forces latent space utilization, thereby improving how well the overall properties of the data set are reproduced. The latent space loss terms have a similar interaction. Here the KLD term regularizes our complete latent space by reducing the difference between the average latent space distribution and a normal Gaussian. The KLD is, however, largely blind to the shape of the individual latent space dimensions, as it only cares about the average. The sampling based latent space regularization term fills this niche by looking at every latent space dimension individually.

Our specific implementation of the BIB-AE framework is shown in Fig. 4. For our sampling based latent regularization we use both an adversary and an MMD term. The adversaries are implemented as critics trained with gradient penalty, similar to the WGAN approach. The main difference in our setup compared to the one described in [24] is that we replaced the \(L_N\)-norm with a third critic, trained to minimize the difference between input and reconstruction. We chose this because we found that using the \(L_N\)-norm to compare the input and the reconstructed output resulted in smeared out images.

For the precise implementation of the loss functions we define the encoder network N, the decoder network D, the latent critic \(C_L\), the critic network C, and the difference critic \(C_D\). The loss function for the latent critic \(C_L\) is given by

$$\begin{aligned} L_{C_{L}}&= \mathbb {E}[C_L(N_{E}(x))] - \mathbb {E}[C_L(\mathcal {N}(0,1))] \nonumber \\&\quad + \lambda \,\, \mathbb {E}[(\parallel \nabla _{\hat{x}}C_L(\hat{x})\parallel _2 - 1)^2 ]. \end{aligned}$$
(6)

Here \(\hat{x}\) is a mixture of the encoded input image N(x) and samples from a normal distribution \(\mathcal {N}(0,1))\) and the E subscript indicates that the network receives the photon energy label as an input. The loss function for the main critic C is given by

$$\begin{aligned} L_{C}&= \mathbb {E}[C_{E}(D_{E}(N_{E}(x)))] - \mathbb {E}[C_{E}(x)] \nonumber \\&\quad + \lambda \,\, \mathbb {E}[(\parallel \nabla _{\hat{x}}C_{E}(\hat{x})\parallel _2 - 1)^2 ]. \end{aligned}$$
(7)

Where \(\hat{x}\) is a mixture of the reconstructed image D(N(x)) and the original images x. Finally, the loss function for the difference critic \(C_D\) is given by

$$\begin{aligned} L_{C_D}&= \mathbb {E} [C_{D,E}(D_{E}(N_{E}(x)) - x)] - \mathbb {E}[C_{D,E}(x - x=0)] \nonumber \\&\quad + \lambda \,\, \mathbb {E}[(\parallel \nabla _{\hat{x}}C_{D,E}(\hat{x})\parallel _2 - 1)^2 ]. \end{aligned}$$
(8)

Where \(\hat{x}\) is a mixture of the difference \(D(N(x)) - x\) and the difference \(x - x=0\). With different \(\beta \) factors giving the relative weights for the individual loss terms, the combined loss for the encoder and decoder parts of the BIB-AE can be expressed as:

$$\begin{aligned} L_{\text {BIB-AE}}&= - \beta _{C_L} \cdot \mathbb {E}[C_{L}(N_{E}(x))] \nonumber \\&\quad - \beta _{C} \cdot \mathbb {E}[C_{E}(D_{E}(N_{E}(x)))] \nonumber \\&\quad - \beta _{C_D} \cdot \mathbb {E}[C_{D,E}(D_{E}(N_{E}(x)) - x)] \nonumber \\&\quad + \beta _{\text {KLD}} \cdot \text {KLD}(N_{E}(x)) \nonumber \\&\quad + \beta _{\text {MMD}} \cdot \text {MMD}(N_{E}(x),\mathcal {N}(0,1))). \end{aligned}$$
(9)

Maximum Mean Discrepancy

Fig. 5
figure 5

Examples of individual 50 GeV photon showers generated by Geant4 (left), the GAN (center left), WGAN (center right), and BIB-AE (right) architectures. Colors encode the deposited energy per cell

One major challenge in generating realistic photon showers is the spectrum of the individual cell energies, which is shown in Fig. 6 (left) in Sect. 4. The real spectrum shows an edge around the energy that a minimal ionizing particle (MIP) would deposit. Since the well-defined energy deposition of a MIP is often used to calibrate a calorimeter, we cannot simply ignore it. However, we found that purely adversarial based methods tend to smooth out this and other similar low energy features, an observation in line with other efforts to use generative networks for shower simulation [28]. A way of dealing with this is using MMD [59] to compare and minimize the distance between the real \((D_{R})\) and fake \((D_{F})\) hit-energy distributions:

$$\begin{aligned} \begin{aligned} \text {MMD}(D_{R}, D_{F}) = \langle k(x,x') \rangle + \langle k(y,y') \rangle - 2\langle k(x,y) \rangle , \end{aligned} \end{aligned}$$
(10)

where x and y are samples drawn from \(D_{R}\) and \(D_{F}\) respectively and k is any positive definite kernel function. MMD based losses have previously been used in the generation of LHC events [46].

A naive implementation of the MMD would be to compare every pixel value from a real shower with every value from a generated shower. This approach is however not feasible since it would involve computing Eq. (10) approximately \((30^{3})^{2}\) times for each shower. To make the MMD calculation tractable, we introduce a novel version of the MMD, termed Sorted-Kernel-MMD. We first sort both, real and generated, hit-energies in descending order, and then take the n highest fake energies and compare them to the n highest real energies. Following this we move the n-sized comparison window by m and recompute the MMD. This process is repeated \(\frac{N}{m}\)-times, where N is the total number of pixels one wants to compare. The advantage of this approach is two-fold, for one the number of computations is linear in N, as opposed to the naive implementation which shows quadratic behavior. The second advantage is that energies will only be compared to similar values, thereby incentivising the model to fine-tune the energy. Specifically, the values \(m = 25\), and \(n = 100\) are used and we chose \(N = 2000\), as this is approximately the maximum occupancy observed in our training data before any low energy cutoffs. In our experiments, adding this MMD term with the kernel function

$$\begin{aligned} k(x,x^{\prime }) = e^{- \alpha (x^{2} + x^{\prime 2} - 2 x x^{\prime })} \end{aligned}$$
(11)

with \(\alpha = 200\) to the loss term of either a GAN or a BIB-AE fixes the per-cell hit energy spectrum to be near identical to the training data. This however comes at a price, as the additional pixels with the energies used to fix the spectrum are often placed in unphysical locations, specifically at the edges of the \(30 \times 30 \times 30\) cube.

Post Processing

In the previous section we found that using an MMD term in the loss function represents a trade off between correctly reproducing either the hit energy spectrum or the shower shape. To solve this, we split the problem into two networks that are applied consecutively but trained with different loss functions. The first network is a GAN or BIB-AE trained without the MMD term. This produces showers with correct shapes, but an incorrect hit-energy spectrum. The second network then takes these showers as its input and applies a series of convolutions with kernel size one. Therefore this second network can only modify the values of existing pixels, but not easily add or remove pixels. This second network, here called Post Processor Network, is trained using only the MMD term to fix the hit energy spectrum, and the mean squared error (MSE) between the input and output images, ensuring the change from the Post Processor Network is as minimal as possible.

Results

In the following we present the ability of our generative models to accurately predict a number of per-shower variables as well as global observables and analyse the achievable gain in computing performance. We include our implementation of a simple GAN (Sect. 3.1), a WGAN with additional energy constrainer (Sect. 3.2), and a BIB-AE with energy-MMD and post processing (Sects. 3.3, 3.4 and 3.5). A detailed discussion of the architectures and training hyper parameters can be found in Appendix A. All architectures are trained on the same sample of 950k Geant4 showers. Tests are either shown for the full momentum range (labeled full spectrum) or for specific shower energies (labeled with the incident photon energy in GeV).

Physics Performance

Fig. 6
figure 6

Differential distributions comparing the per-cell energy (left) and the number of hits above 0.1 MeV (right) between Geant4 and the different generative models. Shown are Geant4 (grey, filled), our GAN setup (blue, dashed), our WGAN (red, dotted) and the BIB-AE (green, solid). The energy per-cell is measured in MeV for the bottom axis and in multiples of the expected energy deposit of a minimum ionizing particle (MIP) for the top axis

We first verify in Fig. 5 that the showers generated by all network architectures visually appear to be acceptable compared to Geant4. Were we attempting to generate cute cat pictures, our work would be done already at this point. Alas, these shower images are eventually to be used as realistic substitutes in physics analyses so we need to pay careful attention to relevant differential distributions and correlations.

Fig. 7
figure 7

Additional differential distributions comparing physical observables between Geant4 and the different generative models. Shown are Geant4 (grey, filled), our GAN setup (blue, dashed), our WGAN (red, dotted) and the BIB-AE with Post Processing (green, solid)

Fig. 8
figure 8

Plot of mean (\(\mu _{90}\), left) and relative width (\(\sigma _{90}/\mu _{90}\), right) of the energy deposited in the calorimeter for various incident particle energies. In order to avoid edge effects, the phase space boundary regions of 10 and 100 GeV are removed for the response and resolution studies. In the bottom panels, the relative offset of these quantities with respect to the Geant4 simulation is shown

In Fig. 6 a comparison between two differential distributions for all studied architectures and Geant4 is shown. The left plot compares the per-cell hit-energy spectrum averaged over showers for the full spectrum of photon energies. We observe that while the high-energy hits are well described by all generative models, both GAN and WGAN fail to capture the bump around 0.2 MeV. The BIB-AE is able to replicate this feature thanks to the Post Processor Network.Footnote 4 This energy corresponds to the most probable energy loss of a MIP passing a silicon sensor of the ILD Si-W ECal at perpendicular incident angle. Since this is a well-defined energy, it can be used in highly granular calorimeters for the equalisation of the cell response as well as for setting an absolute energy scale. It also leads to a sharp rise in the spectrum, as lower energies can only be deposited by ionizing particles that pass only a fraction of the thickness at the edges of sensitive cells or that are stopped. The region below half a MIP, corresponding to around 0.1 MeV, is shaded in dark grey. These cell energies are very small and therefore will be discarded in a realistic calorimeter, as their signal to noise ratio is too low. For the following discussion cell energies below 0.1 MeV will therefore not be considered and only cells above this cut-off are included in all other performance plots and distributions.

Next, the plot on the right shows the number of hits for three discrete photon energies (20 GeV, 50 GeV, and 80 GeV). Here, the GAN and WGAN setups slightly underestimate the total number of hits, while the BIB-AE accurately models the mean and width of the distribution. This behavior can be traced back to the left plot. Since we apply a cutoff removing hits below \(0.1\ \text {MeV}\), a model that does not correctly reproduce the hit-energy spectrum around the cut-off will have difficulties correctly describing the number of hits.

Additional distributions are shown in Fig. 7. The top left depicts the visible energy distribution for the same three discrete photon energies. Both, the shape, center and width of the peak are well reproduced for all models. Due to the sampling nature of the calorimeter under study, the visible energy is of course much lower than the incoming photons’ energy.

In the top right and bottom two plots we compare the spatial properties of the generated showers. First, on the top right, the position of the center of gravity along the z axis is shown. The Geant4 distribution is well modelled by the GANs, however there are slight deviations for the BIB-AE. A detailed investigation of this discrepancy showed that the z axis center of gravity is largely encoded in a single latent space variable. A mismatch between the observed latent distribution for real samples and the normal distribution drawn from when generating new samples directly translates into the observed difference. Sampling from a modified distribution would remove the problem.

Finally, the two plots on the bottom show the longitudinal and radial energy distributions. We see that while all models are able the reproduce the bulk of the distributions very well, deviations for the WGAN appear around the edges.

We next test how well the relation of visible energy to the incident photon energy is reproduced. To this end we use a Geant4 sample where we simulated photons at discrete energies ranging from 20 to 90 GeV in 10 GeV steps. We then use our models to generate showers for these energies and calculate the mean and root-mean-square of the \(90\%\) core of the distribution, labeled \(\mu _{90}\) and \(\sigma _{90}\) respectively, for all sets of showers. The results are shown in Fig. 8. Overall the mean (left) is correctly modelled, showing only deviations in the order of one to two percent. The relative width, \(\sigma _{90}/\mu _{90}\) (right) looks worse: GAN and WGAN overestimate the Geant4 value at all energies. While the BIB-AE on average correctly models the width, it still shows deviations of up to ten percent at high energies. Note that the width cannot be interpreted as energy resolution of the calorimeter due to the two different absorber thicknesses used in the ECal, requiring different calibrations.

Fig. 9
figure 9

Linear correlation coefficients between various quantities described in the text in Geant4 (top left). Difference between these correlations in Geant4 and GAN (top right), Geant4 and WGAN (bottom left), and Geant4 and BIB-AE with post processing (bottom right). The mean absolute differences compared to Geant4 are 0.058 for the GAN, 0.187 for the WGAN and 0.132 for the BIB-AE

Fig. 10
figure 10

Scatter plot showing the correlations between visible energy and number of hits (top) and visible energy and center of gravity (bottom)

Finally, we verify whether correlations between individual shower properties present in Geant4 are correctly reproduced by our generative setups. The properties chosen for this are: The first and second moments in x, y and z direction, labeled as \(m_{1,x}\) through \(m_{2,z}\), the visible energy deposited in the calorimeter \(E_{\text {vis}}\), the energy of the simulated incident particle \(E_{\text {inc}}\), the number of hits \(n_{\text {hit}}\), and the ratio between the energy deposited in the 1st/2nd/3rd third of the calorimeter and the total visible energy, labeled \(E_{1}/E_{\text {vis}}\) through \(E_{3}/E_{\text {vis}}\). The results are shown in Fig. 9. The top left plot shows the correlations for Geant4 showers. We then present the difference to Geant4 for the GAN (top right), WGAN (bottom left), and BIB-AE (bottom right). The smallest differences are observed for the GAN (absolute maximum difference of 0.2), followed BIB-AE (0.36) and WGAN (0.57).

Figure 10 shows examples of 2D scatter plots: the number of hits and the visible energy (top row) as well as the center of gravity and the visible energy (bottom row). These allow us insight into the full correlations between these variables beyond the simple correlation coefficients. Similar to Fig. 9 we see that the GAN matches the Geant4 correlations exceptionally well, while the WGAN and the BIB-AE display some slight correlation mis-matching. The discrepancy in the BIB-AE center of gravity and visible energy correlation can be traced back to the mismodelling of the center of gravity as seen in Fig. 7.

The distributions of physical observables shown above are expected to be the major factor for assessing the quality of a simulation tool. While the correlations are also useful as they provide additional insight, our main focus when evaluating network performance are the physics distributions.

The Importance of Post Processing

Fig. 11
figure 11

Differential distributions comparing physics quantities between Geant4 and the different generative models. The energy per-cell is measured in MeV for the bottom axis and in multiples of the expected energy deposit of a minimum ionizing particle (MIP) for the top axis

In the previous section we demonstrated that our proposed architecture—the BIB-AE with a post processor network—achieved excellent performance in simulating important calorimeter observables. In the following, we will dissect this improvement. To this end we compare a WGAN trained with an additional simple MMD kernel (labelled WGAN MMD), a WGAN trained with the full post processing (labelled WGAN PP), a BIB-AE without post processing (labelled BIB-AE) to Geant4 and to the combined BIB-AE network including post processing (labelled BIB-AE PP) from the main text. We do not investigate a simple GAN with post processing as we expect it to exhibit largely the same behaviour as the WGAN.

In Fig. 11 we show the performance of these approaches. The top left panel of Fig. 11 demonstrates that removing post-processing from the BIB-AE leads to a smeared out MIP peak, while adding the simple MMD term or the more complex post processing to the WGAN result in good modelling of the per-cell hit energy spectrum. However, now this improvement comes at a price: the distribution of the number of hits (top right) is too narrow compared to Geant4 and the longitudinal (bottom center) and radial (bottom right) energy profiles are described badly as additional energy is deposited at the edges of the shower. Especially noticeable is the additional energy in the first and last layers. This would be problematic for standard reconstruction methods that rely on the precise position of the shower start and end. These energy deposits along the image edges are the main reason why the BIB-AE Post Processor is implemented as a separate network rather than integrated in the main decoder structure. The latter would require applying the MMD loss to the entire decoder, which in our test led to energy deposits similar to what can be seen in the WGAN MMD line.

While we were not able to improve the WGAN approach via post processing, we are not aware of fundamental reasons why a better performance using a similar method should not be possible for GAN and WGAN based architectures as well. One reason why AE based architectures might allow better training of post processing steps is however the higher correlation between real input and fake samples via the latent space embedding. Nonetheless, the ability of the BIB-AE framework to make use of this post processing setup motivates future studies of this rather novel architecture for calorimeter shower generation.

Computational Performance

Beyond the physics performance of our generative models, discussed in the previous section, the major argument for these approaches is of course the potential gain in production time. To this end, we benchmark the per-shower generation time both on CPU and GPU hardware architectures. In Table 1, we provide the performance for 4 (3) batch sizes for the WGANFootnote 5 (BIB-AE). We observe a speed-up by evaluating generative models on GPU vs. Geant4 on CPU of up to almost a factor of three thousand. Moreover, the evaluation time of our generative models is independent of the incident photon energy while this is not the case for the Geant4 simulation.

Table 1 Overview of computational performance of WGAN and BIB-AE model, compared to Geant4 full simulation

Conclusion

The accelerated simulation of calorimeters with generative deep neural networks is an active area of research. Early works [1, 25, 26] established generative networks as a fast and very promising tool for particle physics and simulated the positron, photon, and charged pion response of an idealised perfect calorimeter with three layers and a total of 504 cells (\(3 \times 96\), \(12\times 12\), and \(12 \times 6\)).

Using the WGAN architecture and an energy constrainer network [28] allowed the correct simulation of the observed total energy of electrons for a calorimeter consisting of seven layers with a total of 1260 cells (\(12\times 15\) cells per layers). However, a mismodelling of individual cell energies below 10 MIPs, also leading to an observed deviation in the hit multiplicity distribution, was observed and studied. Our implementation of a WGAN based on [28] reproduces this effect (see Fig. 6 (left)). The proposed BIB-AE architecture with additional MMD loss term and Post Processor Network leads to a reliable description of low energy deposits.

The ATLAS collaboration also reported the accurate simulation of high-level observables for photons in a four-layer calorimeter segment with a total of 276 cells (\(7 \times 3\), \(57 \times 4\), \(7\times 7\) and \(7 \times 5\)) using a VAE architecture [31] and 266 cells using a WGAN [32]. Recent progress was made applying a GAN architecture to simulating electrons in a high granularity calorimeter prototype [29]. The considered detector consists of 25 layers with \(51\times 51\) cells per layer, leading to a total of 65k cells to be simulated. On this very challenging problem, good agreement with Geant4 was achieved for a number of differential distributions and correlations of high-level observables. Specifically, the per-cell energy distribution was not reported, however the disagreement in the hit multiplicity again implies a mismodelling of the MIP peak region.

Our specific contribution is the first high fidelity simulation for a number of challenging quantities relevant for downstream analysis, including the overall energy response and per-cell energy distribution around the MIP peak, for a realistic high-granularity calorimeter. This is made possible by the first application of the BIB-AE architecture—unifying GAN and VAE approaches—in physics. Modifications to this architecture, specifically an additional kernel-based MMD loss term and a Post Processor Network, were developed. These improvements can potentially also be applied to other generative architectures and models. Planned future work includes the extension of this approach to also cover multiple particle types, incident positions and angles towards a complete, fast, and physically reliable synthetic calorimeter simulation.