Introduction

The Large Hadron Collider (LHC), built by the European Organization for Nuclear Research (CERN), is the world’s largest collider. The LHCb experiment [1] at LHC focuses on studies of the heavy flavor physics, precise measurements of CP violation, and other effects in and beyond the Standard Model. The LHCb detector consists of several components, including an electromagnetic calorimeter (ECAL). The ability to simulate the expected detector response is a vital requirement for the physics analysis of the collected data and extracting physics results. The use of the Geant4 package [2] for simulating detector responses is computationally expensive and resource-intensive, requiring to speed this process up.

Previously, it was shown [3] that GAN-based frameworks have the potential to serve as fast generative models to speed up the simulation. The auxiliary regression extension [4], that introduces physics metrics to a Discriminator part of the model, allowing it to detect objects with poorly reproduced properties, demonstrated improvements of the physics quality for generated objects. We proposed to train the model in a multitask manner using two objectives: adversarial and regressive ones. However, we used spectral normalization to achieve training stability, and resulted in a reduction in model’s capacity, thus it was complicated for the model to solve multiple tasks at the same moment.

In this paper, we study the relationship between model’s Lipschitz constant and quality of generated objects. We propose a regularization technique that allows to balance between model’s capacity and stability. We compare the quality achieved using different regularization techniques, evaluating the generated objects in terms of both general and physics metrics, including shower asymmetry, shower width, and sparsity level, and study the relationship between the learning rate, method’s hyperparameters and quality of generated objects.

GANs in High Energy Physics

Generative adversarial networks (GANs) approach is a prominent technique for developing generative models. GAN framework comprises two components, the generator and discriminator, which compete against each other to recreate objects from a given distribution [5]. The generator (G) learns to map an easy-to-sample distribution, such as a standard normal distribution, to the target distribution. The discriminator (D) provides feedback to improve the generator by detecting discrepancies between real reference objects and ones generated by the model. Mathematically, this can be represented as a mini-max game:

$$\begin{aligned} \min _G\max _D \mathrm {E_{x \sim p_{data}(x)}}\left[logD(x)\right]+\mathrm {E_{z\sim N(0,I)}}\left[log(1-D(G(z)))\right], \end{aligned}$$
(1)

where \(p_{data}\) is a true data distribution, D(x) is the output of the discriminator.

GAN can be extended to a conditional model (CGAN) if both the generator and discriminator are conditioned on additional information y such as class labels or some property of objects we want to generate:

$$\begin{aligned} \min _G\max _D \mathrm {E_{x \sim p_{data}(x)}}[logD(x | y)]+\mathrm {E_{z\sim N(0,I)}}\left[log(1-D(G(z | y)))\right]. \end{aligned}$$
(2)

The idea of utilizing GANs for simulation in high energy physics was introduced by Paganini et al. [6] and subsequently developed in [7,8,9,10] with the implementation of Wasserstein conditional GAN, [11] investigated the optimization of GANs with hyperparameter scans, [12] applied three-dimensional convolutional layers. Comparisons were made between GAN performance and models based on conditional variational autoencoders, and their combination, with CGAN yielding the best results [13].

Through the Fast Calorimeter Simulation Challenge 2022 [14], the research community organized competition, novel GAN-based [15] methods were introduced, as well as flows [16,17,18] and diffusion-based ones [19,20,21].

Currently, diffusion models show state-of-the-art results in terms of quality of generated objects. In this work, we focus on GANs as they accelerate the simulation the most, providing appropriate quality, and we focus on the stability of the training process.

In our prior work [3], we proposed enhancing the generative model’s quality by adding self-attention layers to the previously best-performing architecture. This approach allowed convolutional neural networks to capture and employ long-range relationships between image regions throughout the training process. Spectral normalization [22] was also added into both models to stabilize the training process.

Lately in [4] we trained GAN with additional regressors that evaluated object’s physics properties simultaneously with the adversarial objective. This setting led to better reproduced metrics that were introduced to the regressor.

Spectral Normalization for GANs

One of the key challenges in the training of GANs is the performance control of the discriminator. This procedure is highly unstable due to step-by-step two-models training setting, therefore, the density ratio estimation done by the discriminator may be inaccurate, generator networks fail to learn the multimodal structure of the target distribution, gradients may vanish and explode [23]. These facts motivated researches to introduce different forms of restrictions and regularization to the choice of the discriminator.

Spectral Normalization introduced in [22] outperforms other techniques, such as weight normalization [23], weight clipping [24], and gradient penalty [25] as the Lipschitz constant is the only hyperparameter to be tuned, and the power iteration method [26], used under the hood of this approach, is relatively fast.

A neural network \(f_\theta\) with parameter \(\theta\) is called Lipschitz continuous if there exist a constant \(c\ge 0\) such that:

$$\begin{aligned} {\Vert f_\theta (t_0) - f_\theta (t_1) \Vert _p} \le c\ {\Vert t_0 - t_1\Vert _p}, \end{aligned}$$
(3)

for all possible inputs \(t_0, t_1\) under a p-norm of choice. The parameter \(c\) is called the Lipschitz constant. Intuitively, this constant \(c\) bounds the rate of change of \(f_\theta\).

Spectral normalization controls the Lipschitz constant of the discriminator function D by constraining the spectral norm of each layer \(g({\varvec{h}}) = W{\varvec{h}}\). By definition, Lipschitz norm \(\Vert g\Vert _\textrm{Lip}\) is equal to \(\sup _{\varvec{h}}\sigma (\nabla _{{\varvec{h}}} g({\varvec{h}}))\), where \(\sigma (A)\) is the spectral norm of the matrix A (\(L_2\) matrix norm of A)

$$\begin{aligned} \sigma (A):= \max _{{\varvec{h}}: {\varvec{h}}\ne {\varvec{0}}} \frac{\Vert A{\varvec{h}}\Vert _2}{\Vert {\varvec{h}}\Vert _2} = \max _{\Vert {\varvec{h}}\Vert _2\le 1} \Vert A{\varvec{h}}\Vert _2, \end{aligned}$$
(4)

which is equivalent to the largest singular value of A. Thus, for a given layer \(g({\varvec{h}}) = W{\varvec{h}}\), the norm is given by \(\Vert g\Vert _\textrm{Lip} = \sup _{\varvec{h}}\sigma (\nabla g({\varvec{h}})) = \sup _{\varvec{h}}\sigma (W) = \sigma (W)\).

It was shown in [22] that if we normalize each \(W^l\) using spectral normalization, we can achieve the fact that \(\Vert f\Vert _\textrm{Lip}\) is bounded from above by 1. Thus, the vanilla spectral normalization requires no hyperparameters to be tuned, however we have no ability to control the degree of regularization.

Lipschitz Networks

Reduction of the Capacity Background

The expressivity reduction issue introduced by the normalization was faced on our own during research, presented in [4]. We suggested training GAN in a multitask manner, adding regressors that evaluate some predefined object’s properties. These regressors share layers and weights with the Discriminator, and we use mean squared error as a loss function for this part of the network, therefore the new objective function has the following form:

$$\begin{aligned} {\mathcal {L}}(\theta ) = {\mathcal {L}}_{adv}(\theta ) + \sum _{ k=1 }^{K}\frac{\alpha _k}{N} \sum _{i = 1}^{N} (o_{i} - {\tilde{o}}_{i})^2, \end{aligned}$$
(5)

where \({\mathcal {L}}_{adv}(\theta )\) is the adversarial part of the Discriminator’s loss, K is the number of the object’s properties, that we evaluate via regressors, \(\alpha _k\) is the weight of k-th regression loss, N is the number of objects, \(o_{i}\) is the real property value and \({\tilde{o}}_{i}\) is the predicted value.

We expected the discriminator to focus on some specific features of the objects, that pass through, forcing the generator to reproduce them better. As the model now has to optimize multiple loss functions and solve additional tasks, our model is required to be more expressive. As traverse cluster asymmetry was reproduced the worst, we set its values as targets for the regressive part of the model.

Without spectral normalization, regression part performed better: this model achieved a lower mean squared error, the second term in 5, nevertheless generative part was highly unstable. However, if we add regression part to the model with layers that were normalized, we do not improve the quality of generated samples as we expected, as the regression task is not solved with an appropriate quality.

These results led us to the need of controlling the L-constant of the discriminator, balancing between training stability and model’s capacity.

Regularization Methods

Through the application of spectral normalization it is possible to stabilize the training procedure, however it directly affects the expressivity of the network. Intuitively, it would be useful to have an ability to control the degree of regularization, balancing between stability and network’s expressive power.

The general idea here is to design a regularization based on the architecture of the k-Lipschitz networks as it was described in [27], constraining all the layers to be 1-Lipschitz and multiplying the final layer with a constant k to make it k-Lipschitz. Making it trainable allows us to treat k as a regularization term (Alpha-k), thus the augmented loss function \({\mathcal {J}}\) with an original loss function \({\mathcal {L}}(\theta )\) and its weight \(\alpha\) looks as follows:

$$\begin{aligned} {\mathcal {J}}(\theta , k) = {\mathcal {L}}(\theta ) + \alpha k \end{aligned}.$$
(6)

A Lipschitz-like regularization may be redefined as the summation of squared Lipschitz bounds of each layer (\(L_{2}\)), so generalized formulation from [26] looks as follows:

$$\begin{aligned} {\mathcal {J}}(\theta ) = {\mathcal {L}}(\theta ) + \alpha \sum _{i = 1}^l \Vert W_i \Vert _p^2 \end{aligned}.$$
(7)

It was reported, that such formulation fails to capture the exponential growth of the Lipschitz constant with respect to the network depth [28], so changing the architecture may require a new \(\alpha\).

Another approach, mentioned in [28], is to define the Lipschitz regularization directly on the weight matrices (\(L_{inf}\)):

$$\begin{aligned} {\mathcal {J}}(\theta ) = {\mathcal {L}}(\theta ) + \alpha \prod _{i = 1}^l \Vert W_i \Vert _\infty \end{aligned}.$$
(8)

Mentioned approaches (6), (7) and (8) do not allow to explicitly set any value as a desired Lipschitz constant of the network. Even though otherwise it requires tuning the hyperparameters, some settings require foreknown constants, e.g., [24] requires 1-Lipschitz critic. Using [27] may lead to constant, that is less than 1, affecting the network’s capacity and gradients used to train the generator, as the only way of controlling the regularization is the weight \(\alpha\). It would be intuitive to include this constant into the training objective in such a way that the network remains unaffected as long as the constant is below the target value, so we suggest defining the augmented loss function using margin value m and hinge-loss-style form:

$$\begin{aligned} {\mathcal {J}}(\theta ) = {\mathcal {L}}(\theta ) + \alpha \max \left(0, \prod _{i = 1}^l \Vert W_i \Vert _2 - m\right) \end{aligned}.$$
(9)

Learning Rate and L-constant

As our approach requires us to set two hyperparameters, we study the relationship between them. In [29] it is shown that in order for the model with Lipschitz constant L to converge learning rate \(\gamma\) should meet the following requirements:

$$\begin{aligned} 0 \le \gamma \le \frac{2}{L} \end{aligned}.$$
(10)

In (6), (7) and (8) we can not set the L value explicitly, but we can play around with the margin value m in (9) and compare the quality of generated objects. For every particular constant we train the model 5 times, using the same pool of learning rates, that remain constant during the fitting procedure. The main objective here is to find a value that allows our model to start to converge, however, we also tried to find a learning rate providing the fastest quality boosting, as it is possible to just set a relatively low learning rate that requires more iterations.

We define quality gain QG as the mean difference of quality of generated objects between adjacent epochs:

$$\begin{aligned} QG = \frac{1}{N_{e}}\sum _{i = 2}^{N_{e}} (PRD_{i} - PRD_{i - 1}) \end{aligned},$$
(11)

where \(PRD_{i}\) is Physical PRD evaluated after the i-th epoch, \(N_e\) is the number of epochs, used to evaluate the gain. As we want to find the starting point for the model to converge and evaluate the speed of quality gain, we use only 30 first epochs to calculate the gain. We plot quality gains versus learning rate (Figs. 4, 5) to find the best \(\gamma\).

ECAL Response Generation

Dataset

The dataset utilized in our experiments comprises information concerning electron interactions within the electromagnetic calorimeter (ECAL). The ECAL employs the "shashlik" technology, which consists of alternating layers of lead and scintillation plates. The readout cells within different modules are of varying sizes, with dimensions of 4 \(\times\) 4, 6 \(\times\) 6, and 12 \(\times\) 12 \(cm^2\), enabling the aggregation of responses of 2 \(\times\) 2 \(cm^2\) logical cells to obtain a response for all granularities. All events in the dataset correspond to electrons with a specific momentum and direction entering the calorimeter at a given location, resulting in the generation of an electromagnetic shower in the ECAL. The sum of all energies deposited in the scintillator layers of a single cell produces a matrix of energies corresponding to the ECAL response for the impacting electron, as depicted in Fig. 1. The dataset of such ECAL responses comprises a 30 \(\times\) 30 cell matrix of 2 \(\times\) 2 \(cm^2\) cells approximately centered on the energy cluster’s location, produced using the GEANT4 v10.4 package. Figure 1 shows a sample of these matrices.

Fig. 1
figure 1

Sample of energy deposition matrices from the ECAL response dataset

We use 60,000 events for training and other 60,000 events as a test dataset. We apply logarithmic transformation \(\log (x + 1)\) to the energy deposit matrix during training, like we did in [3, 4], as it provides a better generation quality.

Quality Evaluation

In our case generated objects do not represent an image in general terms, so we can not use perception-based approaches, that apply some pretrained models to compare real and generated image. Thus, we use precision-recall-based method.

Given a reference distribution P and a learned distribution Q, precision intuitively measures the quality of samples from Q, while recall measures the proportion of P that is covered by Q. As the model generates samples that are close to some real ones, precision increases. Meanwhile, to avoid generating same objects and evaluate variety, the model needs to generate different samples that are close to different real samples, increasing the recall.

Through our experiments we use precision and recall for distributions (PRD) [30], an efficient algorithm to compute these quantities, to evaluate the performance of different models, comparing the quality of generated samples.

As we want to evaluate generated responses not only in terms of general quality but in terms of physics metrics as well, we use the minimum of two PRD-AUC scores, evaluated over raw images (RAW PRD) and a set of the physics statistics (Physical PRD):

  • shower asymmetry along and across the direction of inclination;

  • shower width;

  • the number of cells with energies above a certain threshold, the sparsity level.

PRD requires discrete distributions as its input, so the evaluation pipeline looks as follows:

  • unite the objects from real and generated distributions.

  • use energy deposit matrix as features to represent object in case of RAW PRD.

  • use physical properties of the object as features in case of Physical PRD.

  • cluster all objects using MiniBatchKMeans with 400 clusters based on image or physical features.

  • evaluate the PRD over the pair of histograms built after the clustering procedure.

Results and Discussion

Fig. 2
figure 2

RAW PRD-score of different regularization techniques, evaluated over raw images. Baselines represents model without SN; Alpha-k, L2, Linf represent approaches (6), (7), (8) respectively; Hinge represents our approach (9)

Fig. 3
figure 3

Physical PRD-score of different regularization techniques, evaluated over physical properties ("Quality Evaluation"). Baselines represents model without SN; Alpha-k, L2, Linf represent approaches (6), (7), (8) respectively; Hinge represents our approach (9)

In order to compare the behavior of different approaches, we train the same model multiple times using different regularization terms and weights \(\alpha\) to achieve different L-constants, as it is the only hyperparameter that can be chosen on our own. Our approach requires setting the margin value, thus we divide runs into 6 groups. Every single group represents runs with close values of the achieved L-constant. Then we use the average of constants in every group, achieved using other methods, and set it as a margin for our loss. We choose the best model based on PRD evaluated on the set of physical metrics, as described in "Quality Evaluation".

Methods that were mentioned previously, e.g., (6), were highly sensitive to their only hyperparameter \(\alpha\), the weight of the regularization term. Most of the runs provided us with highly unstable training procedure (low \(\alpha\), high L-constant) or both generative and regressive losses did not improve (high \(\alpha\), low expressivity). In our approach we can explicitly pass the desired constant as a hyperparameter, and even it is not guarantied for model to have exactly this values as its constant, we achieve the desired behavior. Model’s constant becomes closer the margin m, allowing us to directly search for the L value, providing us with the best performance.

Another advantage comes from the fact that our loss do not penalize the model if its constant is less than margin, thus we can pick up a relatively high \(\alpha\) and it would not lead to a constant’s degradation. Basically, if we set margin \(m = 0\), formulation our loss becomes closer to 8, however it may be clearly seen from Fig. 3 that \(L_inf\) without margin provides us with worse quality of reproduced physics metrics, as initially it penalizes the model even with a low L-constant.

During the hyperparameters’ tuning of other approaches, we often faced the problem of L-constant with values below one. This turns into a poor quality, leading us to omit these results on the plot, as they were way worse than the 1-constant baseline. This quality degradation be explained by the fact of model’s decreasing capacity.

According to Figs. 2 and 3 all methods have sweet-spots providing the best possible quality. As the constant is close to 1 the model has lower capacity. As we increase the constant, the training procedure becomes unstable. However, in our approach we don’t penalize the model if it’s constant is less than the margin, and it leads to better results when we allow the model to achieve higher constant. Even in case when we set the margin equal to 1, we boost the quality of the baseline as we allow layers to vary their constants as we only want the product to be lower than one but not the every single constant.

Fig. 4
figure 4

The relationship between the quality gain and learning rate. L/2 lines are plotted as dotted ones

Fig. 5
figure 5

The relationship between the quality gain and learning rate zoomed

According to Figs. 4 and 5, the higher margin and constant we have, the lower learning rate we should use. This fact should be taken into account for our approach, otherwise we can face a situation when \(\gamma > \frac{2}{L} \ge \frac{2}{m}\), model does not converge and our loss does not push the model to decrease its constant.

Figure 4 shows the overall situation, meanwhile Fig. 5 focuses on the part that is mostly important for our needs—the learning rate values when models start to converge. These values slightly differ from the theoretical estimations. However, we may notice that models that have relatively low margins (1 and 8) start to converge only when they have learning rates less than theoretical estimations. Meanwhile, models with relatively high margins (64 and 125) start to do it with even a bit higher learning rates than we expect. This behavior may be explained as we do not set the L-constant of the model explicitly, but we set the margin to add penalty for the cases above it. Thus, the model may achieve even lower constant itself and start to converge with a slightly higher learning rate.

Even though quality gain values of all the trained models with different margins trained using learning rates \(\eta \in [10^{-4}:10^{-2}]\) are close to each other, and these rates may be a reliable choice to pick from, Figs. 2 and 3 show that margins provide different best achieved qualities through the whole training.

Conclusion

In this paper we propose a novel adaptive normalization technique that allows us to control the balance between GAN’s capacity and training stability. We successfully apply this technique to the LHCb ECAL dataset, boosting previously published quality metrics.

To deal with the model’s capacity reduction we introduce an additional loss with a margin hyperparameter, allowing it to avoid penalizing the model with a constant that is lower than the predefined value. We demonstrate that it improves the quality of generated objects, comparing with other techniques.

We also consider the relationship between margin and learning rate and empirically show that increasing the margin we should decrease the learning rate for a better model convergence. This relationship should be taken into account during training procedure hyperparameters’ optimization. The problem of searching the margin value, providing the best quality and stability is going to be a subject for the future work.

Fast simulation applications would benefit from the proposed method, as it allows stabilizing GAN training, providing quality boosting. Our Hinge regularization is a generic technique that is not initially restricted to LHCb ECAL case. We would be interested in applying our regularization to other architectures, even not GAN-based ones, tasks, and datasets.