Simulation-based Anomaly Detection for Multileptons at the LHC

Decays of Higgs boson-like particles into multileptons is a well-motivated process for investigating physics beyond the Standard Model (SM). A unique feature of this final state is the precision with which the SM is known. As a result, simulations are used directly to estimate the background. Current searches consider specific models and typically focus on those with a single free parameter to simplify the analysis and interpretation. In this paper, we explore recent proposals for signal model agnostic searches using machine learning in the multilepton final state. These tools can be used to simultaneously search for many models, some of which have no dedicated search at the Large Hadron Collider. We find that the machine learning methods offer broad coverage across parameter space beyond where current searches are sensitive, with a necessary loss of performance compared to dedicated searches by only about one order of magnitude.


Introduction
Many models of new physics contain multiple free parameters that are not predicated by the theory. Typically, searches pick a one-or two-dimensional slice through this space to set limits on model parameters (see Ref. [1][2][3][4][5][6][7]). This is an effective way to visualize results, but it can also hide potential anomalous regions of the parameter space that are not well-captured by particular slices. In the usual approach, a small number of signal regions (SRs) are constructed to achieve broad sensitivity over the low-dimensional slices through parameter space. These SRs are often optimized using benchmark models. In order to improve sensitivity, SR definitions are sometimes parameterized by a subset of the BSM parameters. Scanning through the parameter space in this way will improve the sensitivity to particular models away from the benchmark models at the cost of analysis complexity and a larger trials factor.
These weakly supervised methods are distinguished by the dataset used to estimate the background. If the background is well-understood theoretically, then the reference sample could be simulation [25,28,62,84]. This has the advantage that the background prediction does not need to be learned, but has the disadvantage of being strongly background-model dependent. There are few final states at the LHC for which the background is known precisely enough to be used directly for background estimation. One exception is the final state with four charged leptons. Both ATLAS [85][86][87][88] and CMS [89][90][91][92] directly use Monte Carlo (MC) simulations to estimate the background and ATLAS even uses machine learning to isolate particular signals [85]. While powerful, this approach is signal modelspecific and does not readily extend to models with multidimensional parameters. In this paper, we explore AD methods applied to the four lepton final state. In particular, we train classifiers to distinguish (simulated) data from background predictions. This approach is complementary to direct searches and has broad sensitivity in a signal model-agnostic approach.
This paper is organized as follows. Section 2 introduces the simulated samples used for the machine learning studies. The machine learning methods are introduced in Sec. 3 and numerical results are presented in Sec. 4. The paper ends with conclusions and outlook in Sec. 5.

Simulations
The generation of background and signal events is performed with MadGraph5 aMC@NLO 2.8.0 [93]. The signal is generated using the Higgs Effective Field Theory (heft) via p p > h with a variable Higgs mass and the background is generated also using heft via p p > e+ e-mu+ mu-. In principle, the same analysis could be applied in the 4-electron and 4-muon final states, but additional complication arises from combinatorical factors. We leave the exploration of such final states to future work. Simulated events are passed to decay (signal only), parton showering, and hadronization using Pythia 8.244 [94][95][96] with its default settings. The detector simulation is parameterized with Delphes 3.4.2 [97][98][99] using the default CMS card. In what follows, we will assume that the background is known without any systematic uncertainties and so the 'data' will be an independent, but statistically identical copy of the SM background. See Ref. [84] for a discussion of how this could be extended to include systematic uncertainties. The number of background events is chosen to match 1 the LHC Run 2 dataset of about 150 fb −1 .
Focusing only on the leptons 2 , each event is characterized by 12 numbers (four threemomenta). For many models of the form pp → A → B(→ e + e − )C(→ µ + µ − ), the three masses m e + e − µ + µ − , m e + e − , m µ + µ − are nearly sufficient statistics for characterizing the new physics. In this paper, we consider signals of this form, where A is a variable-mass Higgs boson that decays to two different mass scalars. For this reason, we focus on the three-1 In principle, the simulation could have a larger number of events than in 'data'. Studies in Ref. [31] suggest this could be effective in the case of surrogate models (not simulation). We leave a detailed study of the relative size of the simulation dataset in to future studies. 2 Other event properties could also be useful for discrimination. However, information about the hadronic final state is known with less precision and thus may introduce the need to involve data-driven background estimation.
dimensional problem in this paper. Non-resonant signals and signals with non-trivial spin structures could benefit from using more of the phase space. The spectra of the three invariant masses for the background and three representative signals are presented in Fig. 1. As expected, the di-electron and di-muon invariant masses peak near the Z boson mass of 90 GeV [100] and there are peaks in the four-lepton invariant mass at the Z peak and the Higgs boson mass of about 125 GeV [100]. Each of the signals is resonant in all three observables with peaks at the masses of the particles. In particular, our three signal models have parent masses of 125, 150, and 250 GeV, respectively. In each case, the parent particle decays to two children, with masses of 25 and 15 GeV for electrons and muons, respectively. These parameters are chosen to be representative; the resulting less-than-supervised analysis does not rely on them in the machine learning training.

Methods
Our machine learning approach is simple, but powerful: we train a classifier to distinguish SM background from 'data'. This 'data' is SM background with some amount of injected signal. If the classifier is able to significantly differentiate these samples, then there is evidence that the 'data' contains BSM events. The significance is determined via bootstrapping [101]. In particular, we create N bootstrap datasets by sampling (with replacement) from the SM background. We then train N classifiers, where each one distinguishes the bootstrap 'data' from the SM background. We compare the classifier performance in the nominal case with the distribution from the bootstraps to compute p-values.
There is no unique choice for which statistic to use to evaluate the performance. Theoretically, the optimal statistics in the absence of systematic uncertainties are those that are monotonically related to the likelihood ratio 3 [102]. In a sense, we are performing a multidimensional, unbinned, and nearly non-parameteric 4 goodness of fit. It may be possible to exploit asymptotic formulae to avoid bootstrapping with additional assumptions [25], but we focus on the numerical approach because it is more general. See Ref. [62] for a comparison of bootstrapping to other approaches.
The classifiers are parameterized as fully connected deep neural networks 5 . These networks are implemented in TensorFlow • Binary Cross Entropy Loss (BCE). This is the most widely-used loss function in machine learning: where the last layer of the neural network is a sigmoid so that 0 ≤ f ≤ 1.
• Maximum Likelihood Classifier Loss [25,106] (MLC). This loss function directly learns the (log) likelihood ratio, which results in useful asymptotic statistical properties: where the last layer of the neural network is now linear so −∞ < f < ∞. 3 See also Appendix A in Ref. [30]. 4 We will use neural networks for the machine learning; these are by definition parameterized functions, but there are so many parameters that they are effectively non-parametric. 5 Given the low-dimensionality of our demonstration, a shallower classifier such as a Boosted Decision Tree would likely also be effective. However, neural network training is straight forward and naturally extends to higher dimensions.
All of the inputs to the neural networks are standardized to have zero mean and unit variance. In order to improve performance, we train 10 neural networks with different random initialization and take the average value per event. The networks are each trained for a fixed 20 epochs. None of the network or training parameters were highly optimized. In addition to comparing BCE and MLC, we also explore multiple metrics for performing the hypothesis test. In particular, we consider the average loss, the average score, the maximum score, and the standard deviation over scores. The average loss is the quantity that is optimized. By averaging over the full phase space, it could be that some information about localized anomalies is lost. This is the motivation for examining also the maximum score and the standard deviation across scores. These quantities are more sensitive to localized deviations, but are also more sensitive to background fluctuations. See also Ref. [62] for studies with alternative metrics, such as area under the receiver operating characteristic (ROC) curve.

Results
The values of the four statistics described in the previous section as a function of the number of injected signal events are presented in Fig. 2 and 3 for the MLC loss and Fig. 4 and 5 for the BCE loss. As expected, the loss decreases and the the mean/max/standard deviation of the scores increases with more signal events injected. As the number of injected signal events goes to zero, the standard deviation over the scores goes to zero. In the MLC case, the neural network approximates the log likelihood ratio and so the average score approaches 0 as the number of injected signal events goes to zero. The MLC loss is approximately minus two times a χ 2 random variable [25,106]. In the BCE case, the neural network approximates p('data'|m e + e − , m µ + µ − , m 4 ), which approaches a δ-function centered at 0.5 as the number of signal events goes to zero. The BCE loss itself approaches log(2) ≈ 0.69.
The green (yellow) band corresponds to the 1σ (2σ) region of the background-only hypothesis. The bands for each signal correspond to the standard deviation across 10 independent signal injections. We show zoomed-out ( Fig. 2 and 4) and zoomed-in (Fig. 3 and 5) versions to emphasize where the bands cross the 2σ line, which approximately corresponds to the 95% exclusion limit. For both loss functions, we find that the standard deviations over the scores is very effective. The loss in the case of the MLC is more useful than the BCE loss and has comparable performance to the score standard deviation. This may be expected, since the MLC loss asymptotically is monotonically related to the likelihood ratio of the two samples (and thus optimal). The max score is must more useful for BCE than for MLC.
The actual limits are comparable for both losses. To put these limits in context, the lmits in Ref. [88] (CMS is similar) for h → XX has a limit of about 0.04 fb in the e + e − µ + µ − channel (nearly independent of m X ). This limit is about 10 times better than what we find   which is the price we pay for being model agnostic 6 . On the other hand, the ATLAS/CMS searches have no sensitivity outside of the fixed m X , m h mass windows, while we have broad sensitivity that is nearly independent of all masses.     MadGraph+Pythia+Delphes pp e + e + Figure 5. BCE Loss. Same as Fig. 4, but zoomed in by a factor of 10 on the horizontal axis. The 95% exclusion limits approximately correspond to where the lines for each signal cross the yellow 2σ band for the background-only hypothesis.

Conclusions
In this paper, we have explored the use of less-than-supervised anomaly detection for the four-lepton final state at the LHC. The motivation of this approach is that it is broadly sensitive to BSM models without specifying model parameters and is thus complementary to model-specific analyses. The core methodology we use is not new. The idea of performing AD by comparing data with a reference sample was first introduced in Ref. [26,27] and studied in the context of a simulated reference sample in Ref. [25,28,84]. Our contribution is to note that the four-lepton final state is special in that it is nearly unique at the LHC as a case where simulations are used directly to estimate Standard Model backgrounds. This allows for the data-versus-simulation strategy and we have explicitly demonstrated how the method works using multiple signal models. We have extended previous results by exploring different statistics that can be used to characterize the effective goodness of fit from the machine learning classifiers. While we have focused on resonant anomalies, we note that the methodology studied here will also work in the case of non-resonant modifications to the four-lepton final state. Many parts of the parameter space are statistics limited, but a full investigation of systematic uncertainties will be required for a complete experimental implementation in the future. A variety of uncertainty-and inference-aware methods have been proposed that can be used in this case [84,[107][108][109][110][111][112][113][114].
With no significant evidence for physics beyond the Standard Model at the LHC so far, it is critical to extend our analysis methodologies to include more model agnostic approaches. It is likely that many techniques will be required to achieve broad sensitivity. We have explored a simulation-driven method in this paper for the four-lepton final state because of the existence of precise background models. Combining simulation-based AD with data-driven background estimation strategies may be able extend the scope of similar methods to a wide variety of final states at the LHC and beyond.