Learning Multivariate New Physics

We discuss a method that employs a multilayer perceptron to detect deviations from a reference model in large multivariate datasets. Our data analysis strategy does not rely on any prior assumption on the nature of the deviation. It is designed to be sensitive to small discrepancies that arise in datasets dominated by the reference model. The main conceptual building blocks were introduced in Ref. [1]. Here we make decisive progress in the algorithm implementation and we demonstrate its applicability to problems in high energy physics. We show that the method is sensitive to putative new physics signals in di-muon final states at the LHC. We also compare our performances on toy problems with the ones of alternative methods proposed in the literature.


Introduction
In the study of the fundamental laws of Nature we face a number of open questions. In the past decades the field of particle physics has produced a set of potential answers that seemed inevitable in their simplicity. The experimental effort inspired by these solutions is now mature and is slowly stripping them of their initial theoretical appeal. As more and more data are collected, the problems that confront us become sharper and harder to solve. We know that the theories that well describe current data are incomplete and should be extended, but our prior beliefs on how the extension should look like and on where to discover it experimentally become less concordant every day. In this paper we show how to interrogate experimental data in a new way, going beyond searches targeted at one specific theoretical model.
We consider the problem of having large multivariate datasets that are seemingly well described by a reference model. Departures from the reference model can be statistically significant, but are caused only by a very small fraction of events. The significance of the discrepancy might stem from the extreme rarity of the discrepant events in the reference model and in this case standard anomaly detection techniques might be employed. Or the discrepancy is due to a small excess (or even a deficit) of events in a region of the space of physical observables that is also populated in the reference model. Our goal is to determine if the experimental dataset does follow the reference model exactly or if it instead contains "small" departures as described above. In the latter case, we also want to know in which region of the space of observables the discrepancy is localized. This problem is relevant to Large Hadron Collider (LHC) datasets that are well described by the Standard Model of particle physics (SM) and CMB datasets that are well described by the standard cosmological model ΛCDM.
Our focus here will be physics cases relevant to the LHC. Many attempts at generalizing traditional new physics searches based on specific models have already been made in this context, developing what are called "model independent" search strategies. As customary in the literature [2][3][4][5][6][7][8][9][10][11][12][13][14][15] by model independent analysis we mean an analysis that does not target a specific new physics model. Typically, analysis techniques of this kind assume that a background model is at hand. This could be obtained from simulation or some simulation-based reweighting of a data control region. 1 Recent papers (e.g., Ref. [16]) started referring to data-driven background estimates as (background) model-independence, to be accompanied by (signal) model-independence for a truly model-independent approach. We do not employ this latter notion of model independence in this study, and we keep the estimate of the background predictions, with the associated uncertainties, as a separated aspect (see below) from the one of model-independence. Notice that model-independent analyses (no matter which notion of model-independence is adopted) do rely on the choice of the specific final state (reconstructed variables and acceptance cuts) to which they are applied. This obviously restricts the sensitivity only to new physics models that contribute to the specific final state, and reintroduces some amount of model-dependence.
Model-independent analyses typically follow the binned histogram technique, in which one selects a set of bins (i.e. search regions) in the space of observables and compares the amount of data observed in each bin with the reference model (i.e., the SM). The main problem with this approach comes from the fact that, as previously emphasized, the data distribution will be identical to the one predicted by the SM in the vast majority of the phase space. The observed countings in almost all bins will thus be in agreement with the SM expectation, but only up to the unavoidable Poisson fluctuations. Poisson fluctuations from non-discrepant bins do contribute to the binned likelihood, and if there are many non-discrepant bins their contribution will overwhelm the contribution the likelihood receives from the few genuinely discrepant bins. This is not an issue in a binned (or unbinned) analysis targeting a specific new physics model, because the bins where no discrepancies are expected do not contribute to the likelihood (which is the ratio between the new physics and the SM Poisson likelihoods). In the model-independent case instead, Poisson fluctuations from non-discrepant bins easily swamp any potential signal of new physics. Typically this can be mitigated only by paying a high price in flexibility [2,17].
In this work we apply a new methodology to the problem, expanding on the ideas presented in Ref. [1]. Our technique leverages the progress that the field of machine learning has experienced in the past few years. In particular we exploit the flexibility of neural networks as multidimensional function approximants [18][19][20][21][22][23][24][25]. Here we show that this idea addresses the challenge presented above for realistic multidimensional datasets and physically motivated putative signals. In particular we consider µ + µ − production at the LHC and we quantify the sensitivity of our method to a resonance Z → µ + µ − and to a non-resonant signal induced by a four-fermion contact interaction.
It should be stressed that the design of the algorithm is purely based on the knowledge of the reference model with the criteria described in Section 2. No optimization was performed based on the putative new physics signals, as appropriate for a model-independent search strategy. We always present the sensitivity of our method in comparison with the "ideal" sensitivity one might obtain with a standard model-dependent search strategy that is instead optimized for the specific model at hand. We will also discuss how the trained neural network can help identifying the physical origin of the observed discrepancy.
At this stage, we assume that the reference dataset provides a perfect representation of the background distribution in real data. In a typical analysis, this is true within the effect of systematic uncertainties, which are controlled by nuisance parameters. The effect of these nuisance parameters can be accounted for in our method, by a straightforward application of the profile likelihood ratio methodology. The practical implementation of this extension of the method will be the topic of a future publication [26]. In this study, we ignore this aspect and concentrate on the more pressing issue of generalizing Ref. [1] to a multivariate problem. This motivates the choice of relatively simple and clean experimental signature: that in fact allows introducing the new method and discussing its strength, knowing that the underlying assumptions (e.g., the possibility of accessing a trustable reference sample with larger statistics than the data sample) will be fulfilled. It is reasonable to expect that extending this method to other final states (e.g., dijet) might imply additional practical problems (e.g., the need of a large reference dataset).
Our results benefit from a crucial methodological advance that we make in this paper compared to Ref. [1]. This consists in an algorithmic procedure to select the regularization parameters of the neural network and the network architecture. We take the regularization parameter to be a hard upper bound (weight clipping) on the magnitude of the weights. While admittedly heuristic (even if based on robust results in statistics), we will see that this procedure uniquely selects the weight clipping and it also gives constraints on the viable neural network architectures 2 .
Machine learning techniques have recently been introduced to solve problems related to the one discussed above [27][28][29][30][31][32][33]. In this paper we also directly compare our sensitivity with that of two related ideas presented in the literature. One has the same goal, but is based on a nearest neighbors estimation of probability distributions [32,33]. The other targets only resonant signals, with the resonant feature occurring in a pre-specified variable, but leverages in a similar way the capability of multilayer perceptrons to identify correlations in multivariate datasets [17]. For the comparison we employ simple toy benchmark examples defined in the corresponding publications. We study these examples with our method and compare our performances with the published results. This is a first step towards an exhaustive comparison of the different proposals (that also include [27][28][29][30][31][32]), which we consider necessary at this stage given the practical difficulties involved in directly evaluating their respective strengths and weaknesses by just reading published work.
The paper is organized as follows. In Section 2, after a brief review of the basic ideas behind our approach (see Ref. [1] for a detailed exposition), we define its detailed implementation. We describe in particular the strategy we adopt to select the neural network architecture and the other hyperparameters. In Section 3 we compare our performances with Refs. [17,32,33] in the context of toy examples. The rest of the paper is devoted to µ + µ − production at the LHC. First, in Section 4, we introduce the new physics signals and the details of our simulated datasets. We also describe the dedicated analyses that we use to estimate the ideal sensitivity. In Section 5 we describe the application of our method and we extensively study its performances. We conclude and outline directions for future work in Section 6.
We first choose a set of variables that describe the data, a range for their values and the integrated luminosity of the dataset. This is the only physics choice that we have to make, which defines the "experiment" we want to analyze. For instance our input space can consist of the momenta of the two leading muons in events with at least two opposite-sign muons within acceptance.
Once we have selected an input space of interest we generate a large reference sample that represents the SM (or, "reference model") prediction. This simulated dataset has much larger statistics than the actual experimental data, we denote with N R the number of events in this sample, while the expected number of events is N (R) N R . We also generate N toy toy datasets that again follow the SM prediction, but have the same statistics as the actual experimental dataset. Namely, they contain a variable number of events, thrown randomly from a Poisson distribution with N (R) expected events. At this point we have prepared the required input for the neural network and we can choose a specific network architecture.
Our neural networks are fully connected, feedforward regressors, trained to learn a likelihood ratio. The training is carried on with a supervised procedure, taking as input the two datasets described above: the large reference dataset that follows the SM (reference) prediction R and a smaller dataset that represents the experimental data D. The training datasets are preprocessed. Input variables allowing negative values, as η, are normalized subtracting their mean and dividing by their standard deviation. The other variables, like p T , are simply divided by their mean. The loss L used to train the network is Here x is an element of the input space (for example 5 numbers describing the two muons p T , rapidity and azimuthal angular difference) and f is the output of the network as a function of the free parameters w (weights and biases) of the network. The values of these parameters after training will be denoted as w in what follows. The neural network defines a composite hypothesis for the distribution (denoted as n(x|w)) of the data, namely where n(x|R) is the distribution (see Table 1 for a summary of our notation) in the reference hypothesis. Our search strategy is constructed as an hypothesis test between the simple hypothesis n(x|R) and the composite (depending on the free parameters w) alternative hypothesis n(x|w).
The loss is constructed [1] to reproduce the maximum log-likelihood ratio (Neyman-Pearson) test statistic t(D) for composite alternative hypothesis [74]. Namely it is such that when N R is much larger than the number of events in D. The minimum of the loss at the end of training (or, more precisely, t(D)) is thus employed as the statistic of our hypothesis test. Notice that after training, the output of the network is an estimate of the log-ratio of the data distribution over the reference distribution: f (x; w) log [n(x| w)/n(x|R)], as a function of the input variables x. It can thus be used, in case of tension between the data and the reference model, to identify the most discrepant regions of the phase space. Armed with the input datasets, the network and loss described above we can analyze the data. The procedure is rather straightforward. The neural network is trained using the reference sample and a data sample collected by the experiment. The loss at the end of training produces a single value t obs for the test statistic. We then train the network again using, instead We also define a corresponding Z score as where Φ −1 is the quantile of a Normal distribution with zero mean and unitary variance, so that Z is conveniently expressed as a number of σ's. The presence of a new physics signal in the experimental dataset would manifest itself as a large value of Z. The discussion above concisely lays out our data analysis strategy (see Ref. [1] for a more complete exposition), to be put in place once the neural network architecture and the other hyperparameters have been selected. We now describe the criteria and the algorithmic procedure by which this selection is made, which constitute the major methodological advance of the present paper.

Hyperparameters Selection
Our goal is to design an effectively model independent search, so the construction of our method must not assume to know anything specific about the signal that we are looking for. Our selection strategy is thus purely based on the reference model (SM) prediction, and relies on two general criteria.
The first criterion that we adopt is flexibility. Namely we would like the neural network to have as many parameters as possible, free to vary in the largest possible range, in order to be sensitive to the largest possible variety of new physics. This has to be balanced against a second criterion, based on the following observation. Our method is mathematically equivalent to the Maximum Likelihood hypothesis test strategy where the set of alternative hypotheses is defined by the neural network. Hence we can rely on the classical results by Wilks and Wald [75,76] (see also [77] for a more recent exposition) according to which the maximum log-likelihood ratio test statistics is distributed in the Asymptotic Limit as a χ 2 with a number of degrees of freedom equal to the number of free parameters in the alternative probability model. From here we conclude that the distribution of our test statistic on reference-model toy datasets (i.e., P (t|R)) approaches in the Asymptotic Limit a χ 2 with a number of degrees of freedom given by the number of parameters of the neural network. Clearly we should not expect this result to hold for a finite dataset. However if it does apply, namely if the distribution does resemble the χ 2 , we can conclude heuristically that the dataset is sufficiently abundant for the network that is being fitted. If instead the distribution violates the asymptotic formula, it means that the test statistic is sensitive to low-statistics regions of the  dataset that are subject to large and uncontrolled fluctuations. We can define this behavior as "overfitting" in our context and restrict ourselves to hyperparameters configurations for which a good compatibility of P (t|R) with the appropriate χ 2 distribution is observed. We will see that combining the two criteria of flexibility and of χ 2 -compatibility dramatically restricts the space of viable options.
In order to illustrate how the optimization strategy works in practice we first need to specify our framework. We restrict ourselves to fully-connected neural networks with logistic sigmoid activation functions in the inner layers. The architecture is characterized by the dimensionality of the input of each of the "L" layers, i.e. by a set of integers a 0 -a 1 -. . .-a L−1 , plus the output dimensionality that is fixed to a L = 1 in our case. So for example a 1-3-1 (L = 2) network acts on a one-dimensional feature space (a 0 = 1), has one inner layer with three neurons (a 1 = 3) and one-dimensional output (a 2 = 1). In this notation the total number of parameters (weights and biases) in the network is N par ( a) = L n=1 a n (a n−1 + 1) .
We regularize the network by imposing an upper bound (Weight Clipping) on the absolute value of each weight. In the following we capitalize Weight Clipping when referring to this specific use of the parameter (i.e., an upper bound on each individual weight). The minimization of the loss function in eq. (2) is performed using ADAM [78] as implemented in Keras [79] (with the TensorFlow [80] backend) with parameters fixed to β 1 = 0.9, β 2 = 0.99, = 10 −7 and initial learning set to 10 −3 . The batch size is always fixed to cover the full training sample. The hyperparameters we want to determine are thus the number of layers and of neurons in each layer (i.e., the architecture of the network), the Weight Clipping parameter and the number of training epochs. The first step of the optimization procedure consists in choosing an initial network architecture. This can be done heuristically by considering the dimension of the input space and the number of events in the datasets of interest. Here we consider for illustration a one-dimensional slice (specifically, the momentum of the leading lepton in the x direction) of the SM di-muon dataset described in Section 4 with a relatively low expected number of data events N (R) = 2000. The number of events in the reference sample is N R = 20000. A small 1-3-1 network is a reasonable starting point in this case. According to the flexibility criterion, the Weight Clipping parameter should be taken as large as possible in order to maximize the expressive power of the network. However if we take it very large training does not converge even after hundreds of thousands of training epochs. This is not acceptable because reaching the absolute minimum of the loss function as in eq. (2) is conceptually essential for our strategy. We observe this behaviour in the upper left corner of Figure 1, where we plot the upper quantiles of P (t|R) as a function of training rounds. The phenomenon is avoided by lowering the Weight Clipping below a certain threshold W max , which we find to be W max 30 in the case at hand as shown in the figure.
The test statistic distribution P (t|R) can now be compared with the χ 2 Npar distribution, with a number of degrees of freedom equal to the number of parameters of the neural network as in eq. (5). We have N par = 10 for the 1-3-1 network. The left panel of Figure 2 displays the evolution with the training rounds of the χ 2 -compatibility, defined as a simple Pearson's χ 2 test statistic on the P (t|R) distribution sampled with 1000 toy experiments. We see that requiring an acceptable level of χ 2 -compatibility further restricts the allowed range for the Weight Clipping parameter. The maximum Weight Clipping for which compatibility is found is 7 in the case at hand. Since the Weight Clipping should be as large as possible to maximize flexibility, this is the value to be selected.
In summary, the strategy we adopt to select the Weight Clipping parameter is the following: 1. Starting from a large Weight Clipping, decrease it until the evolution of the 95% quantiles of P (t|R) achieve a plateau as a function of training epochs.
2. In the range of Weight Clippings below W max where the the plateau is reached, choose the largest Weight Clipping value that gives a good compatibility between P (t|R) and a χ 2 distribution whose degrees of freedom are equal the total number of trainable parameters in the network, as shown in Figure 2.
3. The total number of training epochs should also be fixed. To reduce the computational burden of our procedure this is chosen as the minimum value for which the evolution of the χ 2 -compatibility has reached its plateau.
We should now explore different neural network architectures. In particular we would like to consider more complex architectures than 1-3-1 to increase the expressive power of the network. Complexity can indeed be increased, but not indefinitely as shown in Figure 3 for a 1-10-1 network. A suitable W max can be identified below which the quantiles of P (t|R) converge, but P (t|R) fails to fulfil the χ 2 -compatibility criterion for any choice of the Weight Clipping parameter. The 1-10-1 network should thus be discarded and the optimal (largest) viable network of the 1-N-1 class sits in the range 3 ≤ N < 10. By studying the networks in this range we might uniquely select the architecture and all the other hyperparameters that are suited for the problem at hand. The behaviour described above for the toy one-dimensional dataset has been confirmed in other cases and it is believed to be of general validity. Namely it is generically true that χ 2compatibility places an upper bound on the network complexity, leaving us with a finite set of options to explore. On the other hand we cannot claim that the our selection strategy always singles out a unique hyperparameters configuration. Even in our one-dimensional example one might extend the complexity of the network by adding also new layers, obtaining several viable options with similar number of parameters. Selecting one of these options would require to introduce a strict notion of neural network "complexity", to be maximized. Furthermore one might consider departures from the general neural network framework that we are considering. For instance the weight clipping might be imposed on the norm of the weight vector at each layer rather than on each individual weight, and/or a different weight clipping threshold might be imposed on each layer. Even the choice of logistic sigmoid activations and of fully-connected networks might be reconsidered.
While this aspect should be further studied, it is probably unnecessary to consider this extended space of possibilities. This belief is supported by a number of tests that we performed with different activation functions, training methods and architectures. We find that the performances of our strategy in terms of sensitivity to putative new physics signals depend quite weakly on the detailed implementation of the algorithm. Performances are very similar for all the hyperparameters configurations that are reasonably flexible and obey the χ 2 -compatibility criterion. Even slight departures from compatibility typically do not change the sensitivity appreciably. Establishing this fact for a number of putative new physics signals and for several neural network configurations selected with our criteria would justify the choice of restricting to a single configuration. Or, alternatively, would allow to combine the p-values obtained from different strategies without loosing sensitivity by the look-elsewhere effect.
Before concluding this section it is worth to point out that the compatibility with the χ 2 Npar distribution can be leveraged to compute the p-value without generating a large number of toy experiments This considerably reduces the computational burden of evaluating the global significance. However one should keep in mind that P (t|R) χ 2 Npar (t) is an approximate statement, which we can only test at a statistical level given by the number of toys that we generated. However in cases where generating a sufficient number of toy samples is unfeasible, as for example the high significance models discussed in Section 3, we will be obliged to report estimates of the p-value obtained with this approximation.
We stress that our hyperparameter selection strategy is not based on an a priori assumed new physics model, hence it is not optimal to detect any specific signal. In particular, the network we select with our method might not be expressive enough to be fully sensitive to complex new physics signals. However we will see this is not a problem for the BSM scenarios studied in this paper.

Comparison with Related Work
In the previous section we have introduced all the ingredients needed to implement our data analysis strategy. In this section and in Section 5 we test its performances on a series of examples. This section is devoted to comparisons with other ideas that have related goals.
Machine learning has recently seen a surge of popularity following the latest developments in deep learning and computer vision. A number of works proposing anomaly detection strategies for LHC datasets has appeared in the literature in the past few years [27][28][29][30][31][32][33]. This effort is still relatively recent and the field has not fully matured yet. Ongoing efforts (LHC Olympics [81,82] and DarkMachines [83]) are establishing benchmarks for common comparison.
We take a step towards making the comparison between different strategies more transparent by testing our methodology on some toy examples present in the literature. We consider three examples: two incarnations of a method that has the same goal, but a very different estimation strategy for the test statistic and a third method that has a narrower scope, but a similar technical approach to the problem, based on multilayer perceptrons trained as classifiers.
The first strategy that we compare with is the nearest neighbors approach of Refs. [32,33]. This is a truly model-independent approach 4 that aims at reconstructing the true probability distributions for the data and the reference model, using a nearest neighbors technique [84][85][86]. A comparison to this method is instructive because it allows us to test a completely different approach to the estimation of the likelihood.
We first study the performance of our algorithm on a two-dimensional example, considered in Ref. [33], comprised of events extracted from Normal distributions. The reference model is a two-dimensional Gaussian with mean µ = (1, 1) and covariance matrix Σ = 1 2×2 . The number of expected events in the reference model is N (R) = 20000.
We consider two different putative new physics (NP) models: • NP 1 : the data have mean µ = (1.12, 1.12) and covariance matrix Σ = 1 2×2 . The number of events predicted is the same as in the reference model: N (NP 1 ) = 20000.
The results are shown in Figure 4 (top) for a 2-5-1 network with Weight Clipping 1.2 and 150000 epochs of training. We generated 1000 toy SM samples and 300 data samples distributed according to the new physics hypothesis. The lowest significances that we find, using the χ 2 approximation (6) for the reference model test statistic distribution, are Z χ 2 = 19σ for NP 1 and Z χ 2 = 24σ for NP 2 . The nearest neighbor approach of [33] finds Z = 2.2(3.5)σ for NP 1 (NP 2 ) for 5 nearest neighbors and 1000 permutations used to estimate the test statistic distribution in the reference hypothesis.
An alternative implementation of the nearest neighbors approach was proposed in Ref. [32]. The following two-dimensional problem is considered: • Reference model (R): the data have mean µ = (0, 0) and covariance matrix Σ = 1 2×2 . The number of expected events is N (R) = 10000.  Figure 4 (bottom). We generated 1000 toy SM samples and 300 NP samples. The median significance, obtained with a χ 2 approximation of the test statistic, is 20σ, while Ref. [32] quotes between 5 and 16σ for the nearest neighbors approach depending on the cut on their discriminating variable. We can conclude that both approaches are sensitive to the simple problem at hand. The other idea that we compare with is the bump hunter technique in Ref. [17,87]. This approach does not have the same goal as ours, as it requires prior knowledge of the signal showing   Figure 4: Top: Test statistic distribution for the signal and background models considered in [33], obtained with our analysis technique. The plot displays our sensitivity obtained using the toy samples (Z) and the χ 2 approximation of P (t|R) (Z 2 χ ). The significance quoted in Ref. [33] is 2.2σ for NP 1 and 3.5σ for NP 2 . Bottom: Test statistic distribution for the signal and background models considered in [32], obtained with our analysis technique.  Figure 5: Test statistic distribution for the signal and background models considered in [17,87], obtained with our analysis technique. The figure refers to the 3D toy example discussed in v1 and v2 of [17,87].
up as a peak in a pre-specified variable. It is further assumed that the background distribution of the other variables used in the analysis is the same in the peak and in the sideband regions.
Clearly we have a price to pay in sensitivity for signals that satisfy these assumptions, since we discard this knowledge. On the other hand the approach in [17,87] is much less effective on (or blind to) signals that do not satisfy them. Given these differences, it is instructive to check what is exactly the price that we are paying on resonant signals compared to this refined bump hunter.
We test our strategy on a three-dimensional toy example (see [17,87]) defined by: • Reference model: the three variables m, x and y are uniformly distributed in the ranges |m| < 2, |x| < 0.5 and |y| < 0.5. The number of expected events is N (R) = 10000. Our results are shown in Figure 5 for a 3-5-1 network with Weight Clipping 3.4 and 150000 epochs of training. As in the previous examples, we generated 1000 toy SM samples and 300 NP samples. We obtain a median significance of 8.1σ, to be compared with the 10.8σ claimed in Ref. [17,87] for the optimal choice of the neural network discriminant threshold. In the comparison it should be taken into account that 10.8σ is a local significance, based on prior knowledge of the peak position and width. The degradation due to the need of scanning over the peak position and width (inherent of the bump hunter approach) and over the neural network threshold (specific of this strategy, see Ref. [87]) should be quantified for a better comparison with our 8.1σ significance, which is instead global. However such a refined comparison is unnecessary in this benchmark example because no quantitative meaning should be attached to the asymptotic estimates of such high levels of significance. We can only conclude that our method is sensitive to this toy problem in spite of not being optimized for (and hence limited to) the detection of resonant signals.

Benchmark Examples
In the previous sections we have introduced our methodology and compared our data analysis strategy with two alternative ideas present in the literature. The comparisons involved toy examples that can not be directly mapped on cases of physical interest.
The natural next step is to study the performances of our strategy on more realistic datasets and new physics examples. We choose to study LHC di-muon production and to consider two well-known new physics scenarios. In this section we describe the signal and background samples used for the analysis. The results of our study are presented in the next section. We consider two distinct possibilities for how new physics can manifest itself a resonant signal, represented by a Z decaying to µ + µ − , and a smooth signal given by a contact interaction that we call "EFT". The samples used to study our performances are: SM di-muon The reference sample and the SM toy data are composed of SM Drell-Yan events: pp → µ + µ − . All events were generated with MadGraph5 [88], showered with Pythia6 [89], simulating proton-proton collision at √ s = 13 TeV with an average of 20 overlapping collisions per bunch crossing. The events were further processed with Delphes 3 [90]. We use the default CMS detector card. We run the Delphes particle-flow algorithm, which combines the information from different detector components to derive a list of reconstructed particles. Z to di-muon We study a new vector boson with the same couplings to SM fermions as the SM Z boson. We generate events for three different masses: m Z = 200, 300 and 600 GeV. The signal manifest itself as a narrow resonance at the LHC: Γ Z Γ Z m Z /m Z . The events are generated using the same software and detector cards as the reference model events described above. The number of events in the data sample and the signal to background ratio N (S)/N (R) are varied to study the performances of the algorithm and are discussed in Section 5. The input variables distribution for three representative signal points are shown in Figure 6 and Figure 7.
EFT We consider a non-resonant BSM effect due to the presence of a dimension-6 4-fermion contact interaction (see e.g. [91]) where J L µ a is the SU(2) L SM current and Λ is conventionally set to 1 TeV. We generate di-muon events with the same tools described above (supplemented with a MadGraph5 model for the EFT operator obtained by FeyRules  In Section 5 we will extensively study the sensitivity of our method to the BSM scenarios described above. For a meaningful presentation of the results, and in order to compare the performances on different scenarios, we need an absolute measure of how much a given BSM hypothesis is "easy" to detect with a given integrated luminosity. As in Ref. [1], this measure is introduced by the notion of "ideal significance", described below.

The ideal significance
The ideal significance is the highest possible median Z-score (Z id ) that any search specifically targeted to a given BSM scenario in a given experiment could ever obtain. By the Neyman- Table 2: Mass-window, signal and background fractions for the estimate of Z id in eq. (9).
Pearson lemma, it is obtained using the "ideal" test statistic The ideal significance can be reached only in a fully model-dependent search where the exact knowledge of both the new physics distribution n(x|NP) (see Table 1) and the reference distribution n(x|R) are available. This knowledge is available, in principle, for the BSM scenarios under examination. However computing n(x|NP)/n(x|R) is cumbersome. An estimate for it sufficiently accurate to serve as a reference of performance in the present paper (hence quoted as Z ref ) is readily obtained as follows.
Z to di-muon The signal shows up as a resonant peak in the di-muon invariant mass m ll around the Z mass m Z . A simple cut-and-count strategy in a suitably designed interval m ll ∈ [m min , m max ] around m Z should provide a reasonable estimate of the ideal reach. The ideal significance is thus estimated as In the equation, CDF[P b ] denotes the cumulative of the Poisson distribution with mean "b" while f sig and f bkg are respectively the signal and background fractions in the mass-window The signal fraction is estimated with a Monte Carlo sample consisting of 16000 signal-only events. The background is computed by fitting a Landau distribution to the tail of a SM sample with 1.6 million events. The boundaries of the mass-window [m min , m max ] are selected by optimizing the significance and reported in Table 2 together with the corresponding signal and background fractions. In order to validate eq. (9) as a reasonable estimate of Z id we compared it with the truly "ideal" significance obtained with the Neyman-Pearson test performed on the m ll variable. We fitted the background Monte Carlo data using two Landau distributions (one for 250 GeV ≤ m ll < 600 GeV, the other for m ll ≥ 600 GeV) and a Normal distribution for the Z peak. This allowed us to compute the test statistic in eq. (8) and in turn the ideal significance by toy experiments. Good agreement with eq. (9) was found. Notice however that the comparison was possible only in configurations with low enough Z ref . For cases with Z ref 4, which we do consider in Section 5, validation is unfeasible and we exclusively rely on eq. (9). (1.85 ± 0.01)10 −5 (0.5%) (2.30 ± 0.01)10 −5 (0.4%) (6 ± 1)10 −6 (17%) (2.2 ± 0.6)10 −6 (27%) (1.34 ± 0.09)10 −6 (1.5%) γ (1.432 ± 0.005)10 −5 (0.3%) (2.90 ± 0.01)10 −5 (0.3%) (1.6 ± 0.2)10 −5 (13%) (1.09 ± 0.05)10 −5 (4.6%) (6.769 ± 0.001)10 −6 (0.01%) Table 3: The coefficients of the polynomial fit in eq. (10).
EFT Also in this case, the di-muon invariant mass is the most relevant discriminant. Since the excess is spread over the entire spectrum, the ideal significance is estimated through a Likelihood Ratio (Neyman-Pearson) test on the binned m ll distribution. The number of expected events in each bin is quadratic in c W with coefficients determined from Monte Carlo simulations at varying c W , reported in Table 3.
The test statistic is the log-ratio for the Poisson distributed observed countings "o i " in each bin and the reference significance is evaluated from the distribution of t in the SM (c W = 0) extracted from toy experiments.

Results on Benchmark Examples
In this section we study the sensitivity of our data analysis strategy to the physics examples discussed in the previous section. Our main results are: 1.
In the examples we studied, in all cases where the estimate of the ideal significance exceeds 5σ, the probability of finding a 2σ tension for the SM using our approach is p(α = 2σ) 20% and grows to p(α = 2σ) 40% if we exclude the Z-boson peak from the input data by a cut on the invariant mass. The probability of finding a 3σ tension is p(α = 3σ) 7% and p(α = 3σ) 20% including or excluding the Z-peak, respectively. [ Figure 8] 2. For any given "experiment" (i.e., at fixed luminosity and input space), the observed significance mostly depends on the ideal significance of the putative signal, while it weakly depends on the type of signal. [ Figure 9] Properties "1" and "2" make our technique ideally suited to identify an unexpected new physics signal. Because of "3", if a tension is observed in the data the sensitivity to the signal can be increased with a dedicated analysis on new data, selected using the likelihood ratio learned by the network. As stated in "2" above, Z obs essentially depends only on the ideal significance (approximated by Z ref ) for a given experiment. However in a different experiment (e.g., if we change the luminosity) the relation between Z obs and Z ref changes. The relation becomes more favorable at high luminosity because of point "4".
Let us now turn to an extensive description of the items above, and of our findings on a few technical points relevant to the implementation of the algorithm. For all the results in this paper the minimization of the loss function is performed using ADAM [78] as implemented in Keras [79] (with the TensorFlow [80] backend) with parameters fixed to: β 1 = 0.9, β 2 = 0.99, = 10 −7 , initial learning rate = 10 −3 . The batch size is always fixed to cover the full training sample. Network architecture, size of the weight clipping and number of training rounds were selected following the procedure described in Section 2. Where not specified otherwise, the results were obtained with a 5-5-5-5-1 network and 3 × 10 5 training rounds, using 100 data samples and 100 toy reference samples. The median observed significance plotted in the Figures and its 68% C.L. error were obtained approximating P (t|R) with a χ 2 distribution with as many degrees of freedom as free parameters in the network as discussed in Section 2. We always consider a five-dimensional input space composed of the p T 's and η's of the two leptons and their ∆φ. The range of the input variables and their distribution for three representative signal points are shown in Figure 6 and Figure 7.
Sensitivity The first goal of our study is to show that our technique is sensitive to realistic signals. By realistic we mean having N (S)/N (R) 1, i.e. a small number of signal events compared to the total size of the sample, and ideal significances of order a few σ's. These choices reproduce signals that we might have missed at the LHC so far, if not targeted by a dedicated search. The best way to illustrate the performances of a model-independent strategy is to report the probability it has to identify a tension with respect to the SM if a putative new physics effect is present in the data. This measures the chances that the analysis has to : Sensitivity (Z obs ) to Z → µ + µ − for m Z = 300 GeV and the EFT signal. We show the sensitivity as a function of the reference significance Z ref .
produce an interesting result. In the left panel of Figure 8 we show the probability of finding evidence for new physics at the α = 2σ, 3σ and 5σ levels given a reference significance for the signal. We consider for illustration the Z signal model with m Z = 300 GeV described in the previous section, but similar or better performances are obtained for other masses and for the case of the EFT. The two plots here presented are obtained by fixing the luminosity while the ratio N (S)/N (R) is varied.
On the left panel of the figure, and in the results that follow if not specified otherwise, we applied our algorithm to the entire dataset which includes the SM Z-boson peak. This choice was made in order to challenge our analysis strategy in a situation where the dataset is dominated by the peak, where no new physics effect is present. On the other hand the peak would be excluded in a realistic application of our method to the di-muon final state because it is hard to imagine new physics appearing on the Z peak not excluded by LEP and because detailed analyses of the Z resonant production could be performed separately. If we exclude the Z-peak from the input data, with a cut m ll > 95 GeV (whose efficiency is 10%), the performances of our analysis improve as shown on the right panel of Figure 8.
Another way to quantify the sensitivity is to report the median significance obtained for different new physics scenarios, still as a function of the ideal significance. The result is shown in Figure 9, with the error bars representing the 68% C.L. spread of the observed significance distribution. The study was performed for a given experimental setup, namely by fixing N (R) = 2 × 10 4 (and N R = 5 N (R)), and varying the signal fraction or the EFT Wilson coefficient c W as shown in the legend. We observe, similarly to Ref. [1], a good level of correlation between our sensitivity and the ideal one and a weak dependence on the nature of the new physics. This correlation was sharper in the examples studied in Ref. [1], however it should be taken into account that the present study relies on approximate (see Section 4) estimates of Z id and that high values of Z obs are also approximate, being estimated with the Asymptotic χ 2 formula (see Section 2.1).
Likelihood Learning It is instructive to study directly what the network has learned during training. The network should learn approximately the log-ratio between the true distribution (n(x|T), see Table 1) of the data and the reference model distribution n(x|R). We should thus be able to get information on the nature of the discrepancy by inspecting the likelihood ratio  Figure 10: Comparison between the ideal invariant mass distribution for the Z and EFT signals and the distribution reconstructed by the network and realized in the toy sample taken as input. The probability distribution of the data sample is normalized to the reference one.
learned by the network as a function of the physical observables chosen as input or any of their combinations. In the case of a Z signal, for instance, we would like to see a bump in the invariant mass distribution as learned by the network.
In Figure 10 we plot the distribution ratio learned by the network as a function of the invariant mass of the dimuon system. In the Figure we also show the true likelihood ratio used for the generation of the events and its estimate based on the specific data sample used for training. The signals are the Z with a 300 GeV mass with N (S)/N (R) = 2 × 10 −3 , N (R) = 2 × 10 4 and N R = 10 5 and an EFT signal with the same N (R) and N R and c W = 10 −6 . Notice that m ll is not given to the network, the input variables being the muon p T 's, rapidities and ∆φ.
The ratios in the figure were obtained in the following way. The yellow "ideal" likelihoodratio was obtained by binning the invariant mass of a large data sample, containing one million events, and of the reference sample and taking the ratio. The red likelihood-ratio pertaining to a specific toy was obtained in the same way, replacing the large data sample with the relevant toy. Finally, the ratio as learned by the network was obtained by reweighting reference sample by e f (x, w) , where f is the neural network output after training, binning it and taking the ratio with the reference.
The network is doing a pretty good job in reproducing a peak or a smooth growth (for the Z and the EFT, respectively) in the invariant mass. Therefore if one had access to a new independent data set, distributed like the one used for training (i.e., following n(x|T)), one could employ the neural network f (x, w) (trained on the first dataset) as discriminant (for instance, by a simple lower cut), and boost the significance of the observed tension.
In the studies presented so far we have chosen as input to the network five independent kinematic variables that characterize the di-muon final state under examination, paying attention not to include the invariant mass m ll which is essentially the only relevant discriminant in the new physics scenarios under investigation. This choice was intended to maximize the difficulty of the network task, reproducing the realistic situation where, since the actual signal is unknown, the most discriminant variable cannot be identified and given to the network. However it is interesting to study the potential improvement of the performances that could be achieved with a judicious (but model-dependent) choice of the input variables. The first test we made was to present m ll to the network in addition to the five variables p T 1,2 , η 1,2 and ∆φ. This led to no substantial improvement of the performances suggesting that the neural network is already learning to reconstruct m ll sufficiently well from the five variables and does not need the sixth one. The second test was to trade the variable ∆φ for m ll , considering an alternative fivedimensional parametrization of the phase-space. Notice that ∆φ has no discriminating power whatsoever because the new physics scenarios under examination emerge in 2 → 2 scattering processes where the muons are back-to-back in the transverse plane up to showering and detector effects, as it is the case for the SM. The ∆φ distribution is thus (see Figure 7) strongly peaked at π and identical in the SM and in BSM. Replacing it with m ll , which is instead the most discriminant one, is thus the strongest test we can make of the robustness of our approach against change of input space parametrization. For the m Z = 300 GeV signal with N (S)/N (R) = 10 −3 and N (R) = 2 × 10 4 , whose significance was Z obs = (0.9 +1.3 −0.9 )σ, replacing ∆φ with m ll increases the observed significance to Z obs = (2.3 +1.4 −1.1 )σ.
Luminosity and Signal Fraction. In the left panel of Figure 11 we show our performances for the Z model with m Z = 300 GeV as a function of N (R), i.e. as a function of the integrated luminosity "L" of the dataset. The observed significance shown in the plot is the median over 100 data samples with its 68% C.L. error. The signal fraction is fixed to N (S)/N (R) = 10 −3 , the size of the reference sample is N R = 5N (R) and we increase N (R) from 10 4 to 10 5 . Interestingly, in the regime where these tests are performed, the observed significance increases linearly with the luminosity Z obs ∼ L, as opposed to the √ L growth of the reference significance. This can be explained by the fact that our analysis technique benefits from having enough statistics in the data to accurately reproduce the likelihood ratio. So increasing L does not only make the signal more abundant and easier to see as in standard model-dependent analyses, but it also helps the learning process to reconstruct the most powerful (likelihood ratio) discriminant to detect it. Note however that at some point this behaviour must change and match the usual √ L scaling; this is expected to happen for very large signals corresponding to very large Z ref , somewhat beyond the regimes typically relevant for new physics searches.
Increasing the signal fraction N (S)/N (R) at fixed luminosity has the only benefit of increasing the ideal significance and its estimate. So both Z obs and Z ref increase linearly with the signal fraction as show in the right panel of Figure 11. This study was performed on the m Z = 300 GeV sample with N (R) = 2 × 10 4 , N R = 10 5 as for the study of the luminosity in the same figure. (In)Sensitivity to data selection. As discussed in the Introduction, traditional modelindependent strategies based on countings in bins suffer from the presence of regions in the phase space that are insensitive to new physics, because of uncorrelated Poisson fluctuation in the corresponding bins. In our approach this effect is greatly reduced, because the smoothness of the neural network protects it from following the bin-by-bin statistical fluctuations [1]. One particular implication of this fact is that we expect our sensitivity to depend weakly on the presence or on the absence of selection cuts that eliminate signal-free regions of the phase space. This is illustrated by studying the dependence of the observed sensitivity on: 1) a cut on the p T of the leading muon and 2) a cut on the invariant mass of the di-muon system. The 300 GeV Z model, with N R = 10 5 and N (S)/N (R) = 1 × 10 −4 (before selection) is considered for this investigation.
We find that the p T cut does not alter our sensitivity. For instance the median Z obs remains at 1σ after a p T > 75 GeV selection, in spite of the fact that the cut rejects 96% of the background and only 5% of the signal. The selection on the invariant mass instead slightly increases our sensitivity. For example m ll > 95 GeV (that rejects 90% of the background and nothing of the signal) increases the median significance to Z obs = 1.4σ. 5 We have observed this phenomenon already in Figures 8 and 9.
Reference Size and Optimal Weight Clipping An accurate knowledge of known processes in the phase space of interest is crucial for the success of any new physics search. Therefore the size N R of the Reference Sample should be taken as large as possible, compatibly with the computational price for training. To give an idea of the needed reference sample size, we study the performances of our method as a function of N R /N (R). The result on the left panel of Figure 12 is reassuring: the sensitivity is very stable as a function of this ratio up to N R /N (R) ≈ 1. Below this value the statistical error on the reference sample in the signal region becomes sizable. If we define ε ≡ 1/ N R (278 ≤ m ≤ 322), i.e. counting only events in the invariant mass window populated by the signal (see Section 4) then the first point in the left panel of Figure 12, where Z is degraded, corresponds to ε ∼ 1/2. However this holds for a specific signal (m Z = 300 GeV N (S)/N (R) = 2 × 10 −3 and N (R) = 2 × 10 4 ), in general we expect that a degradation of the performances might be observed if N R is not well above N (R), because of the result shown on the right panel of Figure 12. The plot shows the evolution with N R /N (R) of the Weight Clipping parameter, selected with the criteria of Section 2. The Weight Clipping becomes stable for N R /N (R) 10, but it abruptly drops for smaller values of this ratio. Small Weight Clipping reduces the flexibility of the neural network, which is thus less suited to identify complex new physics signals. Employing Reference samples with N R /N (R) 10, slightly above the benchmark N R /N (R) = 5 we employed here, is thus recommended.

Conclusions and Outlook
We have discussed a new physics search strategy that is "model-independent" (i.e., not targeted to a given new physics model), with the alternative hypothesis needed for hypothesis testing being provided by a neural network. This approach was proposed in Ref. [1]. In this paper we made progress on its implementation and on the study of its performances.
The main methodological advance, described in Section 2.1, consists of a strategy to select the hyperparameters associated with the neural network and its training, prior to the experiment, and without relying on assumptions on the nature of the putative new physics signal. It is crucial to identify one such strategy in order to avoid the look-elsewhere effect from the ambiguities in the choice of the hyperparameters. The one we propose is heuristic, but convincing, and reduces the ensemble of hyperparameters choices to a manageable level. Progress might come on this aspect from a more sharp notion of neural network "flexibility" (or Capacity). Notice however that the concrete impact of the hyperparameters on the sensitivity to new physics signal has been observed to be extremely limited in all the examples we studied. Even if no systematic study has been performed, this suggests that residual ambiguities in the hyperparameters selection could be ignored.
It is not easy to quantify the performances of a model-independent search strategy. The assessment unavoidably relies on the selection of putative new physics models that are potentially present in the data, which we can try to make as broad and varied as possible. Once this choice is made, one way to proceed is to compare the sensitivity to other model-independent strategies. This is what we did in Section 3, finding that our approach compares favorably to other ideas recently proposed in the literature. This comparison is however highly incomplete because it is based on a few toy problems, which are not representative of realistic LHC datasets and where new physics is extremely easy to see with our method. This is a second direction in which further work is needed.
We also need to quantify the performances in absolute terms. To this end, the most indicative quantity is arguably the probability to observe a tension with the SM if the data follow the new physics distribution. That is, the probability for our analysis to produce an interesting result. This is shown in Figure 8 for different levels of observed tension and as a function of the ideal median significance of the putative new physics signal. The latter quantity, defined in Ref. [1] and in Section 4, serves as an objective measure of how "easy-to-detect" the new physics scenario is. Notice that the ideal significance is not the target of our method. The ideal significance can be reached, because of the Neyman-Pearson lemma, only in a fully model-dependent search where all the details of the new physics scenario are known. It cannot be obtained with any model-independent approach. With this in mind, one can still compare the observed and ideal significance directly as in Figure 9. The picture displays a good correlation between the ideal and observed significance in a given experiment and a weak dependence on the type of signal that is responsible for the discrepancy. This behavior might have a deep explanation, which is worth trying to identify. Yet another direction for future work is the assessment of the performances presented for more complex final states than dimuon and for more exotic putative signals.
All the items listed above are worth investigating. However the most pressing aspect to be explored in view of the application of our strategy to real data is the inclusion of the systematic uncertainties in the reference (SM) Monte Carlo. This is conceptually straightforward because our method is based on the Maximum Likelihood approach to hypothesis testing, and systematic uncertainties are easily included in that framework as nuisance parameters. All steps needed to turn likelihood maximization into a training problem are straightforwardly repeated in the presence of nuisance parameters, as mentioned in Ref. [1]. The final outcome is simply that training should be performed against a reference Monte Carlo sample where the nuisance parameters are set to their best-fit values for the dataset under consideration. The concrete implementation of the algorithm in the presence of nuisance parameters thus requires two steps. The first one is to fit the nuisance parameters under the SM hypothesis to the observed data, including auxiliary measurements. Since this first step is the same as in any other experimental analysis, it should not pose any specific issue. Implementing the second step is instead problematic because it would require running the Monte Carlo with the nuisance parameters set to the observed best-fit value. Doing so for many toy SM datasets would be computationally very demanding or unfeasible. Potential solutions are either to obtain the reference sample by reweighting (which will require fitting the dependence on the nuisance of the SM likelihood possibly with a neural network) or to employ a reference sample with benchmark nuisance and correct the test statistics by some approximation of the ratio between the best-fit and the benchmark SM likelihood. It is important to verify if and how these solutions work in practice.
Before concluding it is worth outlining that the problem we are addressing is of rather general relevance in Data Analysis. The methods we are developing could thus find applications outside the specific domain of new physics searches at collider. In abstract terms, the problem can be phrased in terms of two distinct datasets, each of which can be of natural or artificial origin. The first set of data, obeying the "Reference" probability model, must be more abundant than the "Data" because it has to provide both the Reference dataset used for training and the Referencedistributed toy data used to compute the test statistic distribution. In these conditions are met, ours is a strategy to tell if the two datasets are thrown from the same statistical distribution or not, which could be useful in different domains of science. Still remaining in the context of particles physics, other potential applications of our strategies are the comparison of different Monte Carlo generators and data validation.

Appendix: Effects of reference sample's mis-modelling
For this technique to be applied to a New Physics search on real data one needs to carefully match the reference sample to the data. Possible sources of mis-modelling need to be studied and corrected.
The method presented in this work aims at catching discrepancies between the data and the reference, i.e. the best possible description of the SM predictions. At this stage, no distinction is made upon the source of such discrepancies. Therefore a similar response is expected to effects which arise from systematic errors affecting the reference sample and those originated from New Physics phenomena. In the case systematics uncertainties are not properly assessed and coped with, the method would likely lead to type I errors, e.g. false positives.
As already mentioned in Section 6, the method can indeed be extended to include and treat systematic uncertainties as nuisance parameters; the details about this procedure are being worked out and will be properly documented in a future publication [26]. In this appendix we verify the aforementioned hypothesis about how a bias in the reference sample would impact the final result.
The benchmark examples addressed in sections 4 and 5 are again considered, introducing this time an artificial mis-modelling. In order to represent a realistic systematic uncertainty, a mis-calibration of the muon momentum scale is assumed; the relative error is typically of the order of 0.1%, see for instance [93]; still, larger values are also considered in the following, to encompass the cases with final state objects measured with worse accuracy than muons. The mis-calibration is applied separately to the central (|η| < 1.2) and the forward (1.2 < |η| < 2.4) pseudorapidity regions; while the effects are treated as fully correlated, the magnitude in the forward region is assumed three times larger than in the central part. Furthermore, as the decay of the Z boson to muons is usually exploited as a "candle" for in situ calibration, those events are excluded from the analysis, selecting the cases where the mass of the lepton pair is larger than 100 GeV.  Figure 13: Test statistic distributions for toys generated accordingly to the Standard Model, in the case of −0.1% (yellow), 0.5% (red) and −0.5% (green) mis-calibration of the momentum scale.
The response of our test statistic to artificially injected bias is checked both for toys generated according to the Standard Model and for toys containing New Physics. As far as the former are concerned, mis-calibrations of the momentum scale of −0.1% and ±0.5% have been tested (with those values referring to the central pseudorapidity region): as can be seen from the plot in Figure 13, a mis-modeling at the per mil level is not revealed by the algorithm, whereas for larger biases higher values of the test statistics are found and thus a smaller p-value; as expected, the method yields a false positive.
Testing toy datasets including New Physics effects against a mis-modelled reference sample results in the test statistic distributions shown in Figure 14. We consider a Z signal (plot on the left) similar to what used for previous tests, corresponding to a reference significance of about 10σ, yielding an observed significance of 4σ when tested against the unbiased reference sample. If a mis-calibration of ±0.5% on the momentum scale is introduced, the discrepancy between the reference sample and the New Physics toys increases, leading to larger values of the test statistics. The same happens when an EFT signal (with c W = 10 −6 TeV −2 ) is injected and compared to the reference sample: the mis-modelling of the latter enhances the significance of the signal.  Figure 14: Test statistic distributions for toys generated with New Physics signals: Z on the left, a EFT signal on the right. The colours of the histograms represent respectively the cases of no mis-modelling (yellow), 0.5% (red) and −0.5% (green) mis-calibration of the muon momentum scale.
In conclusion, systematic errors in the description of the SM reference sample lead to type I errors; in the case New Physics is present in the data, the corresponding signal is not hidden by the mis-modelling (i.e. we do not incur into false negatives), on the contrary its significance increases.