Learning new physics efficiently with nonparametric methods

We present a machine learning approach for model-independent new physics searches. The corresponding algorithm is powered by recent large-scale implementations of kernel methods, nonparametric learning algorithms that can approximate any continuous function given enough data. Based on the original proposal by D’Agnolo and Wulzer (Phys Rev D 99(1):015014, 2019, arXiv:1806.02350 [hep-ph]), the model evaluates the compatibility between experimental data and a reference model, by implementing a hypothesis testing procedure based on the likelihood ratio. Model-independence is enforced by avoiding any prior assumption about the presence or shape of new physics components in the measurements. We show that our approach has dramatic advantages compared to neural network implementations in terms of training times and computational resources, while maintaining comparable performances. In particular, we conduct our tests on higher dimensional datasets, a step forward with respect to previous studies.


Introduction
Experimental observations and convincing conceptual arguments indicate that the present understanding of fundamental physics is not complete.Our theoretical formulation of the fundamental laws of Nature, the Standard Model, has been predicting with extremely high precision an impressive amount of data collected at past and ongoing experiments.On the other hand, the Standard Model does not provide answer to a multitude of questions including the origin of the electroweak scale, the mass of neutrinos, the flavour structure in the quark, lepton and neutrino sectors, and is unable to account for observed phenomena like the origin and the composition of the dark matter of the baryon asymmetry in the Universe.Further, it does not provide a microscopic description of gravity.These considerations guarantee the existence of more fundamental laws of Nature waiting to be unveiled.In order to access these laws, we must search the experimental data for phenomena that depart from the Standard Model predictions.
Currently, the most common searching strategy is to test the data for the presence of specific new physics models, one at the time.Each search is then optimized to be sensitive to the features specific of the considered new physics scenario.This approach is in general insensitive to sources of discrepancy that differ from those considered.There is therefore a strong effort in developing analysis strategies that are agnostic about the nature of potential new physics and thus complementary to the model-dependent approaches described above [2][3][4][5][6][7][8][9][10][11][12][13].Ideally, this type of analysis should be sensitive to generic departures from a given reference model.In practice, this is a challenge given the complexity of the experimental data in modern experiments and the fact that the new physics signal is expected to be "small" and/or located in a region of the input features which is already populated by events predicted by the reference model.Recently, there has been a strong push towards developing solutions based on machine learning for (partial or full) model-independent searches in high energy physics [1, .
In this work we present a novel machine learning implementation of the analysis strategy proposed by D'Agnolo et al. [1,16].The aim of this strategy is to compute the log-likelihoodratio test statistics without specifying the alternative new physics hypothesis a priori.Towards this end, a neural network model was used in [1,16] to learn the alternative hypothesis directly from the data while the log-likelihood-ratio was maximized to get an optimal test statistics.The strategy assumes that a sample of events representing the Standard Model hypothesis ("reference" sample) is available and that its size is much larger than the one of the experimental data, so that the only relevant statistical uncertainties are those of the data themselves.In the new implementation presented here, neural networks are replaced by kernel methods.Kernel methods are nonparametric algorithms that can approximate any continuous function given enough data.Recent large-scale implementations [38] provides fast and efficient solvers even with very large data-sets.This is relevant since a key bottleneck of the neural network model used in Ref. [1,16] is the extremely long training time, even on low dimensional problems.The solution we propose solves this issue by delivering comparable performances with orders of magnitude gain in training times, see Table 1.We demonstrate the viability of the framework by testing on particle physics datasets of increasing dimensionality, a further step forward with respect to previous studies.
We note that the ideas recently proposed in Ref. [37] share some similarities to our approach.Indeed, the authors of Ref. [37] developed a model-independent strategy based on classifiers to perform hypothesis testing on Standard Model samples and experimental measurements.However, they implement a train-test split of the data for the reconstruction of the test statistics and for inference.This is a major difference with respect to our approach, where the distribution employed for the evaluation of the test statistics is the one that best fits the very same set of data on which the test has to be performed, in accordance with the maximum likelihood philosophy.Moreover, while their approach permits to estimate the distribution of the test statistics with a single training of a classifier, only half of the experimental data is used for new physics detection.A in-depth comparison of the two models will be explored in future works.
The rest of the paper is organized as follows.In Section 2 we introduce the main statistical framework at the basis of this work, elaborating on the discussion in Ref. [1].In Section 3, we discuss the different aspects of the proposed model, in particular the underlying machine learning algorithm.In Section 4, we test the algorithm on realistic simulated datasets in various dimensions and we explicitly compare our proposal with the neural network models in Ref. [1,16].Finally in Section 5, we lay out our conclusions and discuss future developments.In the appendices, we review some background material and present other complementary experiments.

Statistical foundations
In this section, we reprise and elaborate the main ideas in Refs.[1,16], tackling the problem of testing the data for the presence of new physics with tools from statistics and machine learning.
We start by assuming that an experiment is performed and its outcome can be described by a multivariate random variable x.A physical model corresponds to an ensemble of mathematical laws characterizing a distribution for x.In this view, we denote by p(x|0) the distribution of the measurements as described by the Standard Model and by p(x|1) the unknown true distribution of the data.Discovering new physics will be cast as the problem of testing whether the latter coincides with the former or not.
The distribution p(x|0) is essentially known.Although not analytically computable in most high energy physics applications, it can be sampled via Monte Carlo simulations or extracted using control regions with data driven techniques.In the following, we denote one such set of independent and identically distributed random variables (i.i.d.) by and the actual measured data by, It should be pointed out that in real applications one would also consider the uncertainties affecting the knowledge of the reference model.Similarly to Refs.[1, 16], we will assume N 0 N 1 so that the statistical uncertainties on the reference sample can be neglected.It should be possible to include systematic uncertainties as nuisance parameters, as shown in Ref. [39] for the neural network implementation.However, we assume that the systematic uncertainties are negligible in what follows and leave this aspect to future works.
The idea in Ref [1] is to translate the maximization of the log-likelihood-ratio test into a machine learning problem, where the null hypothesis characterising one of the likelihood terms is the reference hypothesis (namely the Standard Model) and the alternative hypothesis characterising the other likelihood term is unspecified a priori and learnt from the data themselves during the training.The test statistic obtained in this way is therefore a good approximation of the optimal test statistic according to the Neyman-Pearson lemma.We define the likelihood of the data S 1 under a generic hypothesis H as where is the data distribution normalized to the expected number of events As already said, p(x|0) is essentially known and well represented by the reference sample while p(x|1) is not and thus its exact form must be replaced by a family of distributions p w (x|1), parametrized by a set of trainable variables w.We can write the likelihood ratio test statistics as, and optmize it by maximizing over the set of parameters w.The original proposal in Ref.
[1] suggested to exploit the ability of neural networks as universal approximators to define a family of functions describing the log-ratio of the density distributions in Eq. ( 6) As discussed below the same approach can be taken replacing neural networks with other machine learning approaches, e.g.kernel methods.Following the above reasoning, the maximum of the test statistic could then be rewritten as the minimum of a loss function and the set of parameters ŵ which maximizes t w (S 1 ) provides also the best approximation of the true underlying data distribution and with it a first insight on the source and shape of the discrepancy, if present.Note that the loss defined by Eq. ( 8) is unbounded from below.In Ref.
[1] a regularization parameter is introduced as a hard upper bound (weight clipping) on the magnitude of the parameters w.

Designing a classifier for hypothesis testing
In this work we develop the above ideas considering a different loss function, namely a weighted cross-entropy (logistic) loss function.This was a possibility mentioned as a viable alternative in Ref.
[1] that we indeed show to yields several advantages.To estimate the ratio in Eq.( 9) we train a binary classifier on S = S 0 ∪ S 1 using a weighted cross-entropy loss (y, where y is the class label and takes value zero for S 0 and one for S 1 .The classifier is obtained minimizing an empirical criterion over a suitable class of machine learning models f w .If such a models class is sufficiently rich, in the large sample limit we would recover a minimizer of the expected risk where p(x, y) is the joint data distribution.By a standard computation (see Appendix A), the function minimizing the expected risk in Eq. ( 12) can be shown to be that, by Bayes theorem and Eq.( 4), we can rewrite as From the above expression and choosing the weights so that Eq.( 13) reduces to Eq.( 9), as desired.In practice, the above condition can be satisfied only approximately, since it depends on quantities we do not know.Hence, we first estimate the class priors using the empirical class frequencies, p(y) ≈ N y /N with N = N 0 + N 1 and obtain Then, we approximate the number of expected events in the alternative hypothesis with the actual number of experimental measurements N (1) ≈ N 1 . 1 The following expression of the weights can then be used in practice, To reconstruct the test statistics in Eq.( 6), the number of expected events in the alternative hypothesis needs to be computed.Using the density ratio in Eq.( 9), we have that Since the reference distribution n(x|0) is not known analytically, we can estimate the above expression using a Monte Carlo approximation considering Using Eq. ( 19), the test statistics in Eq.( 6) can be written as recovering the original result from Ref. [1].
The main conceptual difference with respect to the original solution in Ref.
[1] lays on the computation of the test statistic.When using the loss in Eq. ( 8) the test statistic can be directly obtained from the value of the loss function at the end of the training.When using the cross-entropy loss, each term of the log-likelihood-ratio test is calculated separately and then combined, see Eq. ( 20).This could be a problem if the optimality of the minimization procedure is not ensured.More precisely, in the first case the minimum found at the end of the training is by construction the one maximizing the log-likelihoodratio test, while this is guaranteed only in the asymptotic limit in the second case.On the other hand, as noted before, the loss function in Eq. ( 8) is less well behaved from a mathematical standpoint, making optimization during training less trivial.Interestingly, both loss functions are designed to estimate the same density ration, and in practice we show that they obtain comparable performances in terms of sensitivity to new physics.
We conclude noting that the value of the test statistic t w (S 1 ) is a random variable itself following a distribution p(t|H).The level of significance associated to a value of the test statistic is computed as a p-value of the test statistic with respect to its distribution under the null hypothesis This can be further rewritten as a Z-score where Φ −1 is the quantile of a Normal distribution.In this way Z obs is expressed in units of standard deviations.Following Ref.
[1], by leveraging the possibility to sample from the reference distribution, we choose to reconstruct p(t|0) by estimating the likelihood ratio test statistics on a number N toy of toy experiments run on pseudo datasets extracted from the reference sample.The latters have the same statistics of the actual data but do not have any genuine new physics component.
Class imbalance.To accurately represent the reference distribution, it is preferable to consider a large reference sample, while the number of experimental samples is determined by the parameters of the experiment, specifically its luminosity.This leads to an imbalanced classification problem and a natural approach is to re-weight the loss using the inverse class frequencies N y .The true number of expected events differ from the number of events in the reference hypothesis by the number of expected new physics events, i.e., N (1) = N (0) + N (S).Then, one has that N 1 ∼ Pois(N (0) + N (S)).From both the experimental and theoretical points of view, it is reasonable to assume that N (S) N (0).Therefore, one has that N 1 ≈ N (0).Hence, by using the weight in Eq.( 17), besides recovering the desired target function, we solve potential issues related with an imbalanced dataset, while keeping the statistical advantage of having a large reference sample.

Analysis strategy
The complete analysis strategy can be summarized in three steps: • the test statistic distribution is empirically built by running the training on N toy = O(100) toy experiments for which both the training sample S 1 and S 0 are generated according to the null hypothesis.
• One last training is performed on the dataset of interest S 1 for which the true underlying hypothesis is unknown and the test statistic value t(S 1 ) is evaluated.
• The p-value corresponding to t(S 1 ) is computed with respect to the test statistic distribution under the null hypothesis, studied at step 1.
If a statistically significant deviation from the reference data is found, the nature of the discrepancy can be further characterized by inspecting the learned density ratio in Eq.( 9).This quantity is expected to be approximately zero if no disagreement is found and it can be inspected as a function of the input features or their combinations.
Asymptotic formula Typically, for an accurate estimation of p(t|0), the empirical distribution of the test statistic under the reference hypothesis has to be reconstructed using a large number of toy experiments and this might be practically unfeasible.If the value of t(S 1 ) falls outside of the range of the empirical distribution the p-value cannot be computed and only a lower bound can be set.Inspired by the results by Wald and Wilks [40][41][42] characterizing the asymptotic behavior of the log-likelihood test statistics, we approximate the null distribution with a χ 2 distribution.We use the toy-based empirical estimate to determine the degrees of freedom of the χ 2 distribution and we test the compatibility of the empirical test statistic distribution with the χ 2 hypothesis using a Kolmogorov-Smirnov test.This approximation holds well in almost all instances of our model.We did not explore this aspect in details but we present a counterexample towards the end of Section 4. The same approximation is also used in the neural network model of [1,16].It is worth specifying that, in real-life scenarios, if the p-value computed in this way would imply a discovery, one would run additional toys to obtain an accurate empirical estimation by brute-force exploitation of the large-scale computing resources typically accessible by the LHC collaborations.

Scalable nonparametric learning with Kernels
As mentioned before, a rich model class is needed to effectively detect new physics clues in the data.In this work, we consider kernel methods [38,43] of the form Here k γ (x, x i ) is the kernel function and γ some hyper-parameter.In our experiments, we consider the Gaussian kernel so that f w corresponds to a linear combination of Gaussians of prescribed width γ, centered at the input points.Such an approach is called nonparametric because the number of parameters corresponds to the number of data points: the more the data, the more the parameters.Indeed, this makes kernel methods universal in the large sample limit, in the sense that they can recover any continuous function [44,45].The computational complexity to determine a function as in Eq. ( 23) is typically cubic in time and quadratic in space with respect to the number of points.These costs prevent the application of basic solvers in large-scale setting, and some approximation is needed.Towards this end we consider Falkon [38], which replaces Eq. ( 23) by where {x 1 , ..., xM } ⊂ {x 1 , ..., x N } are called Nyström centers and are sampled uniformly at random from the input data, with M an hyper-parameter to be chosem.Notably, the corresponding solution can be shown to be with high probability as accurate as the original exact one while computable with only a small fraction of computational resources [46][47][48][49][50][51].
We defer further details to the appendices.
Algorithm training The model's weights in Eq. ( 25) are computed to minimize the empirical error (11) defined by the weighted cross-entropy loss introduced before.Since, the kernel model can be very rich, the search of the best model is done considering where the first term is the empirical risk, while R(f ) is a regularization term constraining the complexity of the model [52].Problem (26) is then solved by an approximate Newton iteration [38].

Hyper-parameters tuning
The number of Nyström centers (M ), the bandwidth of the Gaussian kernel σ and the regularization parameter λ are the main hyper-parameters of the model.The number of centers M determines the number of Gaussians, hence it has an impact on the accuracy and on the computational cost; studies suggests that optimal statistical bounds can be achieved already with M = O( √ N ) [49,53].On the other hand, by varying the hyper-parameters σ and λ, more or less complex functions can be selected.For large λ or σ the model simplifies and tends to be linear, while for small values it tends to fit the statistical fluctuations in the data.
The values of M , σ and λ affect the distribution of the test statistic under the reference hypothesis.In particular we observe that the test statistic distribution obtained with different choices of the hyper-parameters always fits a χ 2 distribution with a number of degrees of freedom determined empirically as explained in Section 2.2.More complex functions cause the distribution of the test statistic to move to higher values (see Figure 10a).
On the M direction, a stable configuration is eventually reached and this information can be used to select a proper trade-off value for M (see for instance Figure 10); conversely there is not clear indication on how to choose the values of σ and λ.The bandwidth σ is related to the resolution of the model and its ability to fit statistical fluctuations in the data.To estimate the relevant scales of the problem and find a good trade-off between complexity and smoothness, we look at the distribution of the pairwise (Euclidean) distance in the reference data.We then fix σ approximately as the 90th percentile (see Appendix C and Figure 13 for further details).Finally, λ determines the weight of the penalty term in the loss function, which constraint the magnitude of the trainable weights, and avoid instabilities during the training.We take λ as small as possible so that the impact on the weight magnitude is minimum, while maintaining the algorithm numerically stable.
Summarizing, the hyper-parameter tuning protocol is composed by the following three steps: • We consider a number of centers greater or equal to √ N , with the criteria that more centers could improve accuracy but at the cost of losing efficiency.
• We then fix σ approximately as its 90th percentile of the pairwise distance distribution.
• We take λ as small as possible while maintaining a numerically stable algorithm.
Similarly to the tuning procedure introduced in Ref. [16] for the neural networks, the outlined directives for hyper-parameters selection rely on the reference data alone, preserving model-independence.
We tested this heuristic performing several experiments on the toy scenario presented in Appendix C. In particular, we verified that it gives rise to instances that demonstrate good performances, in terms of sensitivity to new physics clues, across different types of signal.We also verified that the results are robust against small variations of the chosen hyper-parameters.When applied to the final experiments presented in the following section, we followed the prescription given above without any fine tuning that might introduce a bias that favors the specific dataset considered.
Assessing the algorithm performances Following Ref. [1], in order to evaluate different models on benchmark cases it is useful to introduce the ideal significance Z id ., i.e., the value of the median Z-score that is obtained by using the exact (ideal) likelihood ratio test statistics: Typically, this quantity cannot be computed exactly since the likelihoods are not known analytically.We can however obtain an accurate estimate Ẑid using simulated data and model-dependent analyses that leverage what is known about the type of new physics in the data.We will report how Ẑid has been computed for every experiment.

Experiments
In this section, we apply the proposed approach to three realistic simulated high energy physics datasets with an increasing number of dimensions.Each dataset is made of two classes: a reference class, containing events following the Standard Model, and a data class, made of reference events with the injection of a new physics signal.Each case includes a set of features given by kinematical variables as measured by the particle detectors (plus additional quantities when available, such as reconstructed missing momenta and b-tagging information) that we call low-level features.From the knowledge of the intermediate physics processes, one can compute additional high-level features that are functions of low-level ones and posses a higher discriminative power. 2 The different features are used to test the flexibility of the model.The pipeline for training and tuning our method is that described in Section 3.

Datasets
Here, we briefly review some properties of the datasets, how Ẑid is computed and the parameters chosen for the experiments.We refer the reader to Ref. [16,54] for further details.
DIMUON This is a five dimensional simulated dataset that was introduced in Ref. [16] and it is composed of simulated LHC collision events producing two muons in the final state pp → µ + µ − , at a center-of-mass energy of 13 TeV.The low-level features are the transverse momenta and pseudorapidities of the two muons and their relative azimuthal angle, i.e., x = [p T 1 , p T 2 , η 1 , η 2 , ∆φ].We consider two types of new physics contributions: the first one is a new vector boson (Z ) for which we study different mass values (m Z = 200, 300 and 600 GeV); the second one is instead a non-resonant signal obtained by adding a four fermions contact interaction to the Standard Model lagrangian 2 We borrow this nomenclature from Ref. [54].
where J µ La is the SU(2) L Standard Model current, the energy scale Λ is fixed at 1 TeV and the Wilson coefficient c W determining the coupling strength can be chosen between three values (c W = 1, 1.2 and 1.5 TeV −2 ).For both types of signal the invariant mass of the two muons is the most discriminant non trivial combination of the kinematic variables describing the system so we consider it as a high-level feature.We fix N (0) = 2 × 10 4 expected events in the reference hypothesis and the size of the reference sample is N 0 = 10 5 , unless specified otherwise.We vary the number of expected signal events in the range N (S) ∈ [6,80].We selected the following hyper-parameters: (M, σ, λ) = (2 × 10 4 , 3, 10 −6 ).
The ideal significance is estimated via a cut-and-count strategy in the invariant mass m distribution around m Z for the resonant signal, while a likelihood ratio test on the binned m distribution is used for the non-resonant case.
SUSY The SUSY dataset [54] is composed of simulated LHC collision events in which the final state is made of two charged leptons and missing momentum.The latter is given, in the Standard Model, by two neutrinos coming from the fully leptonic decay of the two W bosons.The new physics scenario also includes the decay of a pair of electrically charged supersymmetric particles χ± in two neutral supersymmetric particles χ0 χ0 , undetectable and thus contributing to the missing transverse momentum, and two W bosons.The dataset has 8 raw features and 10 high-level features.
The ideal significance is estimated by training a supervised classifier to discriminate between background and signal with a total of 2M examples, following the approach in Ref. [54].The significance is then estimated by a cut-and-count analysis on the classifier output.
HIGGS The HIGGS dataset [54] is made of simulated events in which the signal is given by the production of heavy Higgs bosons H.The final state is given by a pair of vector bosons W ± W ∓ and two bottom quarks b b for both the reference and the signal components.The dataset has 21 raw feaures and 7 high-level feautures.
The ideal significance is estimated as in the previous case by using the output of a supervised classifier trained to separate signal from background.

Results
Sensitivity to new physics We discuss here the sensitivity of the model to the presence of new physics signals in the data.The test statistic distribution under the reference hypothesis is empirically reconstructed using 300 toy experiments while 100 toys are used to reconstruct the distribution of the test statistic under the alternative new physics scenarios.We show in Figures 2 the median observed significance against the estimated ideal significance with Falkon trained on low-level features only.These experiments were performed by varying the signal fraction N (S)/N (0) (at fixed luminosity) and the type of signal (the latter in the DIMOUN case only).The error bars represent the 68% confidence interval.As expected for a model-independent strategy, the observed significance is always lower than what obtainable with a model-dependent approach.The loss of sensitivity is more pronounced in higher dimensions.Nevertheless, we observe in all cases a correlation between the observed and the ideal significance.In the DIMUON case, the observe significance seems to depend weakly on the type of new physics signal.In Figure 3, we show explicitly, for the Z' new physics with m Z = 300 Gev, the estimated probabilities to find a discrepancy of at least α for a given value of Ẑid .Similar results are obtained with the other types of signal.To test the ability of the kernel-based approach to extract useful information from data, we show in Figure 4 that adding the high-level features does not significantly improve the results, especially in higher dimensions.The plot includes the observed significance, with the bar showing the 68% confidence interval and the grey area representing the region Z (all) obs = Z (low-level) obs ± σ.
Figure 2: Observed significance against estimated ideal significance with low-level input features.Comparison with neural networks To compare the kernel-based approach with the neural network implementation, we considered the results from Ref. [16] for the DIMUON dataset, while we trained the latter on the SUSY and the HIGGS datasets.The considered neural network has 2 hidden layer with 10 neurons each and a weight clipping of w clip = 0.87 for SUSY, while it has 5 layers with 6 neurons each layer and w clip = 0.65 for HIGGS.
Training is stopped after 3 × 10 5 epochs.The results are summarized in Figure 5.We see that, overall, the two approaches give similar results and the degradation of the sensitivity in high dimensions affects both.We notice that in the DIMUON case, the kernel approach is slightly less sensitive, as it can be seen from the results presented in Section 5 of Ref. [16] against Figures 2a and 3. On the other hand, by looking at Figure 5 we see that, for the HIGGS dataset, the kernel approach gives a higher observed significance while, for the SUSY dataset, the two methods give almost identical results.On the other hand, the average training times, summarized in Table 1, demonstrate an advantage in favor of the kernel approach of orders of magnitude.This also allows efficient training on single GPU machines and ensures high scalability for multi-GPU systems, as shown in Ref. [38].Learned density ratio As discussed in Section 2, the function approximated by using the weighted cross-entropy loss is the density ratio given in Eq.( 9).The latter can be directly inspected to characterize the nature of the "anomalies" in the experimental data, if found significant.We report in Figure 6 examples of the reconstructed density ratios as functions of certain high-level features (not given as inputs) together with estimates of the true ratios and extrapolations from the data used for training.The learned density ratio is constructed by re-weighting the relevant high-level feature of the reference sample by e f ŵ (x) (evaluated on the reference training data), binning it and taking the ratio with the same binned reference sample (unweighted).The toy density ratio is computed by replacing the numerator with the binned distribution of the high-level feature of the toy data sample.The ideal case is obtained in the same way but using a large (≥1M) data sample instead.Size of the reference sample A larger reference sample yields a better representation of the reference model, which is crucial for a model-independent search.In Figure 7a, we see that as long as N 0 /N (0) 1, the median observed significance is indeed stable.On the other hand, when the reference sample is too small (N 0 /N (0) < 1), we observe that the correspondence between the distribution of the test statistics and the χ 2 distribution breaks down, see Figure 7b.We observe this behavior for all the datasets.Then, it is in general a good approach to take a reference sample as large as possible keeping in consideration the computational cost of training on a possibly very large dataset.Resources The models based on Falkon have been trained on a server with the specifications reported in

Conclusions
In this work we have presented a machine learning approach for model-independent searches applying kernel-based machine learning models to the ideas introduced in Ref. [1,16].Our approach is powered at its core by Falkon, a recent library developed for large scale applications of kernel methods.The focus of our work is on computational efficiency.Indeed, the original neural network proposal suffers from long training times which, combined with a toy-based hypothesis testing framework, makes the use of the algorithm challenging in high dimensional cases.Our model delivers comparable performances with a dramatic reduction in training times, as shown in Table 1.As a consequence, the model can be efficiently trained on single GPU machines while possessing high scalability for multi-GPU systems [38].In contrast, the neural network implementation crucially relies on per toy parallelization, hence the need for large scale resources such as CPU/GPU clusters.
Similarly to Ref. [16], the applicability of the proposed method relies on a heuristic procedure to tune the algorithm hyper-parameters.A more in-depth understanding of the interplay between the expressibility of the model, its complexity and the topology of the input dataset could lead to more performant and better motivated alternatives to the current hyper-parameter selection.Further investigations are left for future work.One possibility would be to find a more principled way to relate Falkon hyper-parameters to physical quantities.This could also allow the introduction of explicit quantities to be optimized, opening to the possibility of applying modern optimization techniques for the selection of the hyper-parameters.
Besides the challenges related to the algorithm optimization and regularization, an essential development for the application to realistic data analysis concerns the treatment of systematic uncertainties which has not been considered in the present work.This aspect was successfully addressed on a recent work [39] in the context of the neural network implementation.
Finally, the boost in efficiency provided by the model developed in this work could extend the landscape of applicability of this analysis strategy to other use cases, beyond the search for new physics, and to other domains.In particular, the application to multivariate data quality monitoring in real time is currently under study.
where g k ∈ R n such that (g k ) i = l (f (x i ), y i ).With this strategy, the overall computational cost to achieve optimal statistical bounds is O(n √ n log n) in time, and in O(n) in memory, making it suitable for large scale problems.

C 1D example
We consider here a simple univariate toy scenario taken from Ref.
[1].We use this example to present explicitly all the steps discussed in Section 2 and Section 3.
Data.We know here explicitly both the reference and the true data generating distributions.The former is an exponential The latter is given by the reference distribution combined with a signal component and reads as where p(x|S) is the distribution of the signal alone and N (S) is the expected number of signal events.We consider three types of signals, two that are localized in a region of the input feature (resonant) and one that is not (non-resonant).They are given by the following expressions (see also Figure 8): • A Gaussian distribution centered in the tail of the exponential background • A Gaussian distribution centered in the bulk of the exponential background • A non-resonant signal given by  We select a large reference sample of size N 0 = 2 × 10 5 and an expected number of background events N (0) = 2 × 10 3 .The size of the data sample is then N 1 ∼ Pois(N (y)), with N (1) = N (0) + N (S) when the data sample is generated according to the true data distribution.

Model tuning
In Figure 9 we show the distribution of the pairwise distances from which we select the bandwidth as approximately the 90th percentile.In this case, it corresponds to σ ≈ 0.3.In Figure 10a we show that the test statistics averaged over twenty independent runs (with reference data only) reaches a plateau at M ≈ 500 ≈ √ N 0 .This suggests that the estimated distribution of the test statistics under the null hypothesis does not change if more centers are selected.On the other hand, larger values of M might increase the sensitivity of the model to new physics, at the expense of efficiency in training times and memory.By looking at the average training time, as reported for instance in Figure 10b, we fix M = 3000 .Figure 10a also shows that the value of the test statistics increases for more complex models (smaller σ and/or λ).Finally, we take λ = 10 −10 because the training would show occasional instabilities at smaller values.

Model training and results
We first reconstruct the distribution of the test statistics under the null hypothesis p(t|0).We train the algorithm on N toys = 300 toy reference samples (N (S) = 0).In this simple scenario, we use our complete knowledge of the problem to reconstruct the distribution of the test statistics under the alternative hypothesis p(t|1) by performing multiple experiments with N toys = 100 toy data samples with injection of new physics events.The reconstructed distribution for the non-resonant case is shown in Figure 11.We can see that the test statistics with reference data follows a χ 2 distribution with 9.58 degrees of freedom (determined with a Kolmogorov-Smironov test).The median observed significance for the three cases is Z obs = (2.We can also inspect the learned density ratio to characterize the potential new physics clues.In this 1D case, this amounts to simply look at where exp(f ŵ(x)) deviates significantly from one.We show some examples in Figure 12.They are obtained by showing the ideal (exact) likelihood ratio, the ratio between the (binned) toy and reference samples and the learned functions.
Finally, Figure 13 shows that the results are stable around the selected bandwidth σ across the different types of new physics signals.

Figure 1 :
Figure 1: Distribution of the test statistics under the null and alternative hypotheses for the DIMUON (left) and HIGGS (right) datasets.

Figure 3 :
Figure 3: Probability of finding a α = 2σ, 3σ, 5σ evidence for new physics as a function of the ideal significance.

Figure 4 :
Figure 4: Comparison of the observed significance obtained with Falkon using low level features only and all the features.

Figure 5 :
Figure 5: Observed significance with the Falkon implementation against neural networks.

Figure 6 :
Figure 6: Examples of reconstructed density ratios as a functions of high-level features (not given as inputs) for the DIMUON (left) and SUSY (right) datasets with new physics components in the data.Note that the SUSY dataset is normalized.

Figure 7 :
Figure 7: Observed significance as a function of the size of the reference sample(left).Example of distribution of the test statistics given a small reference sample (right).
Average test statistics as a function of the number of Nyström centers.
Average training time as a function of the number of Nyström centers.

Figure 11 :
Figure 11: Distribution of the test statistics under the null and alternative hypotheses (nonresonant new physics).

Table 2 .
The NN experiments have been performed on a CPU farm with 32 computing nodes of Intel 64 bit dual processors, for a total amount of 712 core.

Table 2 :
Specifications of the machine used to perform the experiments with Falkon. ) 43, 2.82, 3.04).The average training time for a single reference toy is t train ≈ 2.11 s.In this case we can compute the ideal testFigure 12: Reconstructed density ratios.Non-resonant signal (left) and Gaussian in the tail (right).This quantity is then evaluated on a large number (10M) of reference examples to accurately reconstruct p(t|0) and on 300 data samples for each type of signal.This was done in Ref.[1]and the resulting values are Ẑid = (4.7,4.1, 4.4).We lose approximately 1.6σ of sensitivity on average.
statistics exactly using the true distributions as follows t id (S) = −2 −N (S) +