GAN-AE: an anomaly detection algorithm for New Physics search in LHC data

In recent years, interest has grown in alternative strategies for the search for New Physics beyond the Standard Model. One envisaged solution lies in the development of anomaly detection algorithms based on unsupervised machine learning techniques. In this paper, we propose a new Generative Adversarial Network-based auto-encoder model that allows both anomaly detection and model-independent background modeling. This algorithm can be integrated with other model-independent tools in a complete heavy resonance search strategy. The proposed strategy has been tested on the LHC Olympics 2020 dataset with promising results.


Introduction
The search for New Physics beyond the Standard Model is one of the main goals of highenergy physics.A fairly common strategy is to search for a localized deviation in an invariant mass spectrum that could correspond to a new heavy particle.This kind of search usually depends on accurate simulations of the Standard Model processes and also on several signal hypotheses.However, simulating data from experiments such as ATLAS [1] is computationally intensive and is limited by modelling uncertainties.Also, assuming a signal model without knowing what lies beyond the Standard Model can be a source of bias that reduces the generalizability of an analysis.
To overcome these limitations, much effort has been put into defining generic search strategies that do not rely on specific theoretical models of New Physics.One possible solution is to use algorithms that don't need a specific signal model to train on, but still detect events that differ from the Standard Model predictions.Such unsupervised anomaly detection algorithms [2] can potentially identify anomalous events by evaluating an anomaly score, so that in the search for New Physics processes, signal events can be seen as an anomaly with respect to the Standard Model.
A well-known class of anomaly detection algorithms using unsupervised machine learning is the auto-encoder (AE) and its derivatives [3,4].Such models can be trained directly on data with the only assumption that signal events are very rare.In the following sections, we present a GAN-AE algorithm inspired by AEs and generative models that allows for both anomaly detection and data-driven background modeling.This model is tested on the LHC Olympics 2020 challenge dataset [5] as a benchmark.For this search a complete strategy including the model independent BumpHunter algorithm [6] has been defined.The code used to build and train the GAN-AE algorithm on this dataset is accessible online 1 .Network (GAN) [7].Other algorithms propose similar models, such as Outliers Exposure [8] and Self-Adversarial AE [9].In these works, the goal is either to constrain the latent space of an AE or to improve the sensitivity to anomalies in a semi-supervised setting.With the GAN-AE algorithm, the objective is to construct an alternative measure of reconstruction error using a multilayer perceptron network trained to distinguish reconstructed and original events.Figure 1 shows a synoptic view of the GAN-AE architecture.
Figure 1: Schematic of the global layout of the GAN-AE architecture.The auto-encoder network (AE) is trained to produce reconstructed events that closely resemble the original events.The discriminator network (D) is trained to discriminate between reconstructed and original events with labels 0 and 1, respectively.
Traditionally, auto-encoders are trained using a possibly regularized measure of the (Euclidean) distance between their input and output.A well known metric for this task is the Mean Square Error (MSE).In this work, we propose an alternative metric based on a supervised discriminator network trained to classify reconstructed events (labeled 0) and original events (labeled 1).This binary classifier (bc) model is trained with the usual binary cross-entropy loss function: bc y (d) , y (l) = − y (l) log y where y (d) is the output of the discriminator and y (l) the associated label.
In order to train this two-party GAN-AE network, we define a training procedure divided into two main phases.The first step is to train the discriminator network parameters θ D with a mixture of original data and events reconstructed by the AE.Parameters θ D are then updated for a few epochs while keeping the parameters θ AE of the AE fixed.
The second step is to train the auto-encoder parameters θ AE using the discriminator output as constraint.This training is done with a special loss function that combines both the usual distance metric and the information coming from the discriminator.The distance metric used is a modified Euclidean distance defined as: with y (o) the input vector (original event), y (r) the output vector (reconstructed event) and N the dimension of both vectors.The constraint of the discriminator is introduced by modifying the binary cross-entropy loss function defined in equation 1.In fact, while the goal of the discriminator is to correctly identify reconstructed events associated with the label '0', the goal of the AE is, on the contrary, to confuse the discriminator network.Thus, the AE must be trained so that the output of the discriminator comes closer to the label '1' corresponding to (real) original events.This can be achieved by computing the binary cross-entropy loss of the discriminator using reconstructed events associated with the label of the original events as the target.The two metrics are then combined to define the loss for a given event k as follows: with ε a hyperparameter that balances the relative importance of the two terms.This loss is used to update θ AE for a few epochs.
The AE has an architecture composed of 5 layer: the encoding part with the input layer, a hidden layer and the latence space, and a decoding part that is exactly symmetrical to the encoder part.The activation function used for the hidden layers is the LeakyReLU function, while the latent space and output are linear.As an additional constraint, we used the tied weight trick discussed in [10] to impose that the weight tensors of the decoder are the transposed ones of those of the encoder: where W (k) is the weight tensor between layers k and k + 1 of the encoder and W (n−k) is the weight tensor between layers n − k and n − k − 1 of the decoder.Dropout is applied to each hidden layers.
The structure of the discriminator network is defined as a fully connected multilayer perceptron with 4 hidden layers using LeakyReLU activation.The output is one-dimensional with a sigmoid activation function compatible with the binary cross-entropy loss function.Dropout is applied to the hidden layers of the discriminator.
The main hyperparameters of the GAN-AE algorithm are reported in Table 2.In this architecture, the discriminator is used to enhance the training of the auto-encoder.However, in the application step.only the trained AE is actually used.The anomaly score is defined as the modified Euclidean distance (equation 2) .Thus, the most anomalous events, here assimilated to the most signal-like events, can be identified as those with the highest anomaly score.The selected anomalous events can then be compared to a reference to test for the presence of an anomaly.The next section describes how to obtain this reference.

Background modelling an mass sculpting mitigation
In order to integrate the GAN-AE algorithm into a complete and fully data-driven search strategy, we propose a method to extract a viable background model directly from the data.This method is based on the hypothesis that the signal that we might expect to find in the data is a rare process, such that the data is dominated by the background.In this case, when performing a bump hunt in a relevant spectrum, such as an invariant mass, one would expect the signal to be invisible unless proper selections are made.Thus, the invariant mass spectrum prior to any selection could be assimilated to a background distribution.
However, in order to use this distribution as a reference background, we must ensure that its shape is not affected by the selection based on the anomaly score described in the previous section.Even if the GAN-AE model is trained without using the invariant mass as an input variable, this condition is generally not met, as illustrated in Figure 2. To get rid of the mass sculpting induced by the selection process, we propose two mitigation techniques that can be combined.First, an event weight is applied in order to uniformize the invariant mass distribution.This is done because otherwise events with a low invariant mass will be overrepresented in the data compared to others, leading to a bias in the reconstruction error.Then, to reduce the mass sculpting, the Distance Correlation regularization (DisCo) [11,12] is added to the loss of the auto-encoder.This is done because otherwise event with a low invariant mass will be over-represented in the data compared to others, inducing a bias in the reconstruction error.of the Auto-Encoder.As it requires the use of independent and identically distributed samples of the distribution to decorrelate, this term is defined for a batch of events.
By combining the DisCo regularization term and the event weighting, we can define the modified loss function of the auto-encoder: (5) with w i the weight associated to event i, N b the number of events in a batch, α a new hyperparameter of the loss, y (m) the vector of invariant mass value associated to a batch and D Σ the vector of anomaly score value associated to a batch.Note that the event weights should not be applied when computing the DisCo regularization.Since the goal of this term is to decorrelate the invariant mass and anomaly score distributions, it is important to keep both distributions unchanged.
With this new loss function, we can ensure that the invariant mass distribution prior to the selection on the anomaly score is a valid reference model for the background.Now we need to compare this reference with the distribution of selected events in order to look for a localized deviation.For this purpose we use the pyBumpHunter package [13] 2 which provides an improved version of the BumpHunter algorithm [6] implemented in Python.This tool has the advantage of locating any deviation in a model-independent way, evaluating both local and global significance by removing the Look Elsewhere Effect [14].Now we have all the tools needed to build a complete and model-independent strategy for resonant New Physics searches.The next section shows an example of application using a benchmark dataset.

Application to LHC Olympics 2020 data
In order to test and evaluate the performance of the techniques developed in the previous section, we use the public dataset proposed for the LHC Olympics 2020 challenge [5].This dataset provides a good case study for testing and comparing anomaly detection algorithms in the context of model-independent New Physics searches.The strategy that we will use for this challenge is illustrated in figure 3. The challenge proposes a so-called RnD dataset [15] to assist the development of anomaly detection algorithms.This dataset is composed of a background sample containing QCD multijet events and a benchmark New Physics signal model.The signal events consist of a Z' boson with a mass of 3.5 TeV (inspired by [16]) decaying into two heavy resonances X and Y with masses of 500 GeV and 100 GeV, respectively.Two types of signal signatures are considered, one where both X and Y decay to two quarks and form boosted jets with 2-pronged substructure, and another where both X and Y decay to three quarks, resulting in boosted jets with 3-pronged substructure.A total of 1M events were generated for the background model, along with 100k events for each signal hypothesis.The events are generated using Pythia8 [17] and Delphes 3.4.1 [18] with no pile-up or multiple parton interaction included, and with a detector architecture similar to the ATLAS experiment.Events are selected using a large radius (R=1) jet trigger with a p T threshold of 1.2 TeV.
The anomaly detection algorithms are tested on three different Black Box datasets [19] containing unknown event samples.The only information given to the challenge participants is that the events contain at least two jets with a different background modelling than the RnD data.The goal is then to determine if there is a hidden signal in the Black Boxes and at what mass.
For each event, a list of up to 700 hadrons 4-vectors is provided.Jets are reconstructed using the anti-k t algorithm implemented in the FastJet 3.3.3library [20] with a large jet radius R = 1.A second clustering is performed within the large jets with a smaller radius r = 0.2 in order to characterize their substructure.The list of the variables computed in this preprocessing procedure is presented in Table 1.For a clustering in two large jets, we have a total of 45 variables.The code used to preprocess the data is publicly available3 .

Results on RnD data
In order to evaluate the performance of the GAN-AE algorithm and validate the background modeling procedure, we use the RnD dataset.The results are presented for a clustering in two large jets.The GAN-AE model is trained on 100k background events and tested on a mixture of background and signal.All variables listed in Table 1 are used in the training except for the di-jet invariant mass and the azimuthal angle ϕ of the jets, for a total of 42 input variables.The set of hyperparameters used to produce the results are shown in The anomaly scores obtained for the background and both signal test samples are shown in Figure 4a.These corresponding ROC curves are shown in Figure 4b.The Area Under the Curve (AUC) obtained on the test set is 0.82 for the first RnD signal (2-prong) and 0.74 for the second (3-prong) This result confirm that the Auto-Encoder trained using the GAN-AE algorithm is able to distinguish the signal from the background.
Another point to check is the ability to remove mass sculpting.The modeling of the reference background distribution, after applying a selection on the anomaly score, is evaluated using background events of the testing set. Figure 5a shows the normalized distribution of the di-jet invariant mass, before and after selection at different thresholds.To quantitatively assess the deformation of the invariant mass spectra induced by the selection, we use Jensen-Shannon divergence as a metric [22].By continuously varying the selection threshold, we can evaluate this metric to produce the curve shown in Figure 5b.Compared to the results  shown on Figure 2, the invariant mass distribution is no longer modified when applying a selection based on the anomaly score.The fact that Jensen-Shannon divergence stays below 0.1 up to a 99 th percentile threshold indicates that the invariant mass distribution of the background before selection remains compatible with that after selection.By comparison, a GAN-AE model trained without the mass sculpting mitigation techniques results in the Jensen-Shannon divergence curve shown in Figure 6.This metric increases rapidly with the selection threshold reaching more than twice the distance obtained with the mitigation techniques.This strong constraint on the mass sculpting can be realized simultaneously with the good ability to separate signal and background shown in Figure 4.This achievement is a good improvement over classically trained Auto-Encoders for which applying such constraints generally deteriorates the quality of the anomaly detection.

Results on Black Box datasets
After validating the GAN-AE algorithm and the mass sculpting mitigation procedure, we can apply the complete strategy chain to the Black-Box datasets.For each Black-Box, a GAN-AE model is trained on 100k events using the set of hyperparameters presented in Table 2.The trained model is applied to each dataset in order to evaluate the anomaly score distribution.A selection is applied on the 99 th percentile of this distribution.Then, the invariant mass distribution of the di-jets in this subsample is compared to the invariant mass distribution of the di-jets in the pre-selection data, which serves as a reference background.The reference histogram is normalized to the selected data using a side-band normalization procedure.Results obtained with pyBumpHunter for Black-Box 1 are presented in Figure 7.The BumpHunter algorithm finds a deviation in data, with respect to the data-driven reference background, around 3.97 TeV with a local significance of almost 3σ (Figure 7a).No other significant excess, or deficit, is observed outside the selected interval.Figure 7b shows the background-only test statistics from which a global significance of 1.2σ is derived.The low overall significance is partly explained by the fact that the bump hunt search is performed without assuming a prior signal and with a floating background normalization.After the end of the challenge, the content of each Black-Box was revealed by the organisers.Figure 8 shows the histograms of di-jet invariant mass in Black-Box 1, along with the true labels corresponding to background and signal events.The region of the spectrum identified by the BumpHunter algorithm is indeed the location of the true signal.The signal generated for this dataset corresponds to a 3.8 TeV Z' boson decaying to two heavy resonances with a similar 2-prong substructure jet signature as in the RnD data.
The initial signal over background ratio (S/B) is 0.08%.After applying the full strategy chain to this dataset, we obtain an improvement in the S/B ratio of a factor of 20.The signal efficiency after selection at the 99 th percentile of the anomaly score distribution is over 15% for a background rejection of almost 99%.We also note that the data-driven reference background fits quite well the true background distribution after selection.The deviation identified by BumpHunter corresponds to the true signal with a small bias on the mass of the Z' (less than 200 GeV).The same methodology has been applied to the two other Black-Boxes and results are summarized below.Black-Box 2 did not contain any signal, as this data set was actually provided for the purpose of testing the identification of spurious signals.Our algorithm successfully modeled the shape of the background and found no significant deviations.The third black box contained a complex signal signature, as the generated resonance could decay into either two or three jets, with a branching ratio of one third and two thirds, respectively.In the case of Black-Box 3 and with the 2-jet clustering, the GAN-AE algorithm was unable to distinguish between signal and background events.However, the process of modeling the background shape from the data still worked.

Conclusion
The development of alternative search strategies for New Physics beyond the Standard Model has gained much importance in recent years.Events such as the LHC Olympics challenge proposed in 2020 are part of this effort.In this context, we propose a model-independent analysis strategy based on unsupervised machine learning and data-driven background modeling.
The GAN-AE algorithm offers an interesting alternative to the classical training of autoencoders by defining an new measure of reconstruction error given by an adversary network.This algorithm offers good performance and stability, even when using strong constraints to reduce the mass sculpting such as the DisCo regularization term.Thanks to this constraint, we can derive a reference background model directly from the data, with the only assumption that the signal is rare enough.The background model can then be used as a reference for the BumpHunter algorithm, which allows the evaluation of both local and global significance.
The strategy was tested using the LHC Olympics 2020 challenge datasets.The results on the RnD dataset as well as on the first black box are promising, allowing us to correctly identify the hidden signal with a local significance of 2.9σ.This result is comparable to those obtained by other participants.Our strategy is also the only one to propose a built-in evaluation of the global significance, showing its completeness.A possible way to improve the method would be to include the GAN-AE algorithm in a weakly supervised setting, such as the Tag N'Train (TNT) algorithm [23], which obtained one of the best results in the LHC Olympics 2020 challenge.

Figure 2 :
Figure 2: Normalized histograms of the invariant mass.The blue histogram shows the spectrum before applying any selection to the anomaly score.The orange and green histograms show the spectra after selection at the 50th and 85th percentiles of the anomaly score distribution, respectively.The data used to obtain this figure is described in Section 3.

Figure 3 :
Figure 3: Diagram representing the analysis flow applied for the LHC Olympics 2020 challenge.

Figure 4 :
Figure 4: Results obtained with the RnD data of the LHC Olympics 2020 challenge showing the separation of background and signal: (a) anomaly scores for background and signal events; (b) ROC curves obtained from the test set of the RnD data.The labels signal 1 (orange) and signal 2 (green) correspond to 2-prong and 3-prong jet substructure, respectively.

Figure 5 :
Figure 5: Results obtained with the RnD data of the LHC Olympics 2020 challenge showing the capacity to mitigate the mass sculpting: (a) di-jet invariant mass of background events before (blue) and after selection at the 50 th (orange) and 85 th (green) percentile of the anomaly score distribution, (b) Jensen-Shannon divergence between the invariant mass distribution before and after selection for different thresholds.

Figure 6 :
Figure 6: Jensen-Shannon divergence obtained using the mass sculpting mitigation techniques (blue) and without using them (orange) (a) In the top panel, histograms of dijet invariant mass after selection (blue) and reference background (solid red).In the bottom panel, local significance per bin of the invariant mass histograms.The vertical dashed lines represent the interval selected by the BumpHunter algorithm.(b) Distribution of test statistic obtained for background-only pseudo-data (blue histogram), together with the value obtained for observed data (dashed line).

Figure 7 :
Figure 7: Results obtained with the data of Black-Box 1 of the LHC Olympics 2020 challenge after applying the complete analysis chain.

Figure 8 :
Figure 8: Histograms showing the true background (blue) and signal (orange) distributions for Black-Box 1, after selection on the 99 th percentile of the anomaly score distribution.The reference background histogram used for BumpHunter is shown in green and the selected interval is represented by the vertical dashed lines.

Table 1 :
1 , τ 2 , τ 3 , τ 21 , τ 31 Energy Rings E ring,1 , E ring,2 , ... , E ring,10 Dijet invariant mass mjj Table summarizing of all variables, computed in the preprocessing of the LHC Olympics 2020 data, for each large jets, except for the last variable which is defined for pairs of jets.

Table 2 :
Hyperparameters of the GAN-AE algorithm and their values.A pre-training of the Auto-Encoder is performed without adversary using only the reconstruction error before the main training loop.