Guiding New Physics Searches with Unsupervised Learning

We propose a new scientific application of unsupervised learning techniques to boost our ability to search for new phenomena in data, by detecting discrepancies between two datasets. These could be, for example, a simulated standard-model background, and an observed dataset containing a potential hidden signal of New Physics. We build a statistical test upon a test statistic which measures deviations between two samples, using a Nearest Neighbors approach to estimate the local ratio of the density of points. The test is model-independent and non-parametric, requiring no knowledge of the shape of the underlying distributions, and it does not bin the data, thus retaining full information from the multidimensional feature space. As a proof-of-concept, we apply our method to synthetic Gaussian data, and to a simulated dark matter signal at the Large Hadron Collider. Even in the case where the background can not be simulated accurately enough to claim discovery, the technique is a powerful tool to identify regions of interest for further study.


Introduction
The problem of comparing two independent data samples and looking for deviations is ubiquitous in statistical analyses.It is of particular interest in physics, when addressing the problem of searching for new phenomena in data, to compare observations with expectations to find discrepancies.In general, one would like to assess (in a statistically sound way) whether the observed experimental data are compatible with the expectations, or there are signals of the presence of new phenomena.
In high-energy physics, although the Standard Model (SM) of particle physics has proved to be extremely successful in predicting a huge variety of elementary particle processes with spectacular accuracy, it is widely accepted that it needs to be extended to account for unexplained phenomena, such as the dark matter of the Universe, the neutrino masses, and more.The search for New Physics (NP) beyond the SM is the primary goal of the Large Hadron Collider (LHC).The majority of NP searches at the LHC are performed to discover or constrain specific models, i.e. specific particle physics extensions of the SM.Relatively less effort has been devoted to design and carry out strategies for model-independent searches for NP [1][2][3][4][5][6][7][8].At the current stage of no evidence for NP in the LHC data, it is of paramount importance to increase the chances of observing the presence of NP in the data.It may even be already there, but it may have been missed by model-specific searches.
Recently, there has been growing interest in applying Machine Learning (ML) techniques to high-energy physics problems, especially using supervised learning (see e.g.Refs.[9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27] and in particular the recent work of Ref. [8] with which we share some ideas, although with a very different implementation).On the other hand, applications of unsupervised learning have been relatively unexplored [9,23].In unsupervised learning the data are not labeled, so the presence and the characteristics of new phenomena in the data are not known a priori.One disadvantage of unsupervised learning is that one cannot easily associate a performance metric to the algorithm.Nevertheless, unsupervised methods such as anomaly (or outlier) detection techniques, or clustering algorithms, provide powerful tools to inspect the global and local structures of high-dimensional datasets and discover 'never-seen-before' processes.
In this paper, we propose a new scientific application of unsupervised learning techniques to boost our ability to search for new phenomena in data, by measuring the degree of compatibility between two data samples (e.g.observations and predictions).In particular, we build a statistical test upon a test statistic which measures deviations between two datasets, relying on a Nearest-Neighbors technique to estimate the ratio of the local densities of points in the samples.
Generally speaking, there are three main difficulties one may face when trying to carry out a search for the presence of new processes in data: (1) a model for the physics describing the new process needs to be assumed, which limits the generality of the method; (2) it is impossible or computationally very expensive to evaluate directly the likelihood function, e.g.due to the complexity of the experimental apparatus; (3) a subset of relevant features needed to be extracted from the data, otherwise the histogram methods may fail due to the sparsity of points in high-dimensional bins.
A typical search for NP at LHC suffers from all such limitations: a model of NP (which will produce a signal, in the high-energy physics language) is assumed, the likelihood evaluation is highly impractical, and a few physically motivated variables (observables or functions of observables) are selected to maximize the presence of the signal with respect to the scenario without NP (the so-called background).
Our approach overcomes all of these problems at once, by having the following properties: 1. it is model-independent: it aims at assessing whether or not the observed data contain traces of new phenomena (e.g.due to NP), regardless of the specific physical model which may have generated them; 2. it is non-parametric: it does not make any assumptions about the probability distributions from which the data are drawn, so it is likelihood-free; 3. it is un-binned : it partitions the feature space of data without using fixed rectangular bins; so it allows one to retain and exploit the information from the full high-dimensional feature space, when single or few variables cannot.
The method we propose in this paper is particularly useful when dealing with situations where the distribution of data in feature space is almost indistinguishable from the distribution of the reference (background) model.Although our main focus will be on high-energy particle physics searches at the LHC, our method can be successfully applied in many other situations where one needs to detect incompatibilities between data samples.
The remainder of the paper is organized as follows.In Section 2 we describe the details of the construction of our method and its properties.In Section 3 we apply it to case studies with simulated data, both for synthetic Gaussian samples and for a more physics-motivated example related to LHC searches.We outline some directions for further improvements and extensions of our approach, in Section 4. Finally, we conclude in Section 5.

Statistical test of dataset compatibility
In general terms, we approach the problem of measuring the compatibility between datasets sampled from unknown probability densities, by first estimating the probability densities and then applying a notion of functional distance between them.The first task is worked out by performing density ratio estimation using Nearest Neighbors, while the distance between probability densities is chosen to be the Kullback-Leibler divergence [28].We now describe our statistical test in more detail.

Definition of the problem
Let us start by defining the problem more formally.Let i=1 be two independent and identically distributed D-dimensional samples drawn independently from the probability density functions (PDFs) p T and p B , respectively: We will refer to B as a 'benchmark' (or 'control' or 'reference') sample and to T as a 'trial' (or 'test') sample.The R D space where the sample points x i , x i live will be referred to as 'feature' space.The primary goal is to check whether the two samples are drawn from the same PDF, i.e. whether p B = p T .In other words, we aim at assessing whether (and to what significance level) the two samples are compatible with each other.More formally, we want to perform a statistical test of the null hypothesis {H 0 : p T = p B } versus the alternative hypothesis This problem is well-known in the statistics literature as a two-sample (or homogeneity) test, and many ways to handle it have been proposed.We want to construct a statistical hypothesis test of dataset compatibility satisfying the properties 1-3 outlined in the introduction.
First, the B, T samples are going to be analyzed without any particular assumptions about the underlying model that generated them (property 1); our hypothesis test does not try to infer or estimate the parameters of the parent distributions, but it simply outputs to what degree the two samples can be considered compatible.
Second, if one is only interested in a location test, such as determining whether the two samples have the same mean or variance, then a t-test is often adopted.However, we assume no knowledge about the original PDFs, and we want to check the equality or difference of the two PDFs as a whole; therefore, we will follow a non-parametric (distribution-free) approach (property 2).
Third, we want to retain the full multi-dimensional information of the data samples, but high-dimensional histograms may result in sparse bins of poor statistical use.The popular Kolmogorov-Smirnov method only works for one-dimensional data, and its multidimensional extensions are based on binning the data.Alternative non-parametric tests like the Cramér-von Mises-Anderson test or the Mann-Withney test require the possibility of ranking the data points in an ordinal way, which may be ill-defined or ambiguous in highdimensions.Thus, we will employ a different partition of feature space not based on fixed rectangular bins (property 3), which allows us to perform a non-parametric two-sample test in high dimensions.
So, in order to construct our hypothesis test satisfying properties 1-3, we need to build a new test statistic and construct its distribution, as described in the next sections.

Test statistic
Since we are interested in measuring the deviation between the two samples, it is convenient to define the ratio of probability densities to observe the points in the two samples, in the case p B = p T (numerator) relative to the case p B = p T (denominator) λ ≡ x j ∈B p B (x j ) x j ∈T p T (x j ) The above quantity may also be thought of as a likelihood ratio.However, as we are carrying out a non-parametric test, we prefer not to use this term to avoid confusion.Now, since the true PDFs p B,T are not known, we follow the approach of finding estimators pB,T for the PDFs and evaluate the ratio λ on them λ = We then define our test statistic TS over the trial sample as where |T | = N T is the size of the trial sample.This test statistic will take values close to zero when H 0 is true, and far from zero (positively or negatively) when H 0 is false.The test statistic defined in Eq. (2.5) is also equal to the estimated KL divergence DKL (p T ||p B ) between the estimated PDFs of trial and benchmark samples, with the expectation value replaced by the empirical average (see Eq. (A.2)).Of course, other choices for the test statistic are possible, based on an estimated divergence between distributions other than the KL divergence, e.g. the Pearson squared-error divergence.The exploration of other possibilities is beyond the scope of this paper and is left for future work.
Ultimately, we want to conclude whether or not the null hypothesis can be rejected, with a specified significance level α (e.g.α = 0.05), therefore we need to associate a p-value to the null hypothesis, to be compared with α.To this end, we first need to estimate the PDFs pB,T from the samples, then compute the test statistics TS obs observed on the two given samples.Next, in order to evaluate the probability associated with the observed value TS obs of the test statistic, we need to reconstruct its probability distribution f (TS|H 0 ) under the null hypothesis H 0 , and finally compute a two-sided p-value of the null hypothesis.
The distribution of the test statistic is expected to be symmetric around its mean (or median), which in general may not be exactly zero as a finite-sample effect.Therefore, the two-sided p-value is simply double the one-sided p-value.
A schematic summary of the method proposed in this paper is shown in Figure 1.In the remainder of this section we will describe this procedure in detail.

Probability density ratio estimator
We now turn to describing our approach to estimating the ratio of probability densities pB /p T needed for the test statistic.There exist many possible ways to obtain density ratio estimators, e.g. using kernels [29] (see Ref. [30] for a comprehensive review).We choose to adopt a Nearest-Neighbors (NN) approach [31][32][33][34][35][36][37].
Let us fix an integer K > 0. For each point x j ∈ T , one computes the Euclidean distance1 r j,T to the Kth nearest neighbor of x j in T \ {x j }, and the Euclidean distance r j,B to the Kth nearest neighbor of x j in B. Since the probability density is proportional to the density of points, the probability estimates are simply given by the number of points (K, by construction) within a sphere of radius r j,B or r j,T , divided by the volume of the sphere and the total number of available points.Therefore, the local nearest-neighbor estimates of the PDFs read pT (for any x j ∈ T ) where ω D = π D/2 /Γ(D/2 + 1) is the volume of the unit sphere in R D .So, the test statistic defined in Eq. (2.5) is simply given by The NN density ratio estimator described above has been proved to be consistent and asymptotically unbiased [33,34,36], i.e. the test statistic TS (2.8) built from the estimated probability densities converges almost surely to the KL divergence between the true probabilities in the large sample limit N B , N T → ∞.
Two advantages of the NN density ratio estimator are that it easily handles highdimensional data, and its calculation is relatively fast, especially if k-d trees are employed to find the nearest neighbors.As a disadvantage, for finite sample sizes, the estimator (2.8) retains a small bias, although several methods has been proposed to reduce it (see e.g.Refs.[33,38]).
The use of NN is also convenient as it allows the partition of the feature space not into rectangular bins, but into hyper-spheres of varying radii, making sure they are all populated by data points.
The test statistic TS in Eq. (2.8), being an estimator of the KL divergence between the two underlying (unknown) PDFs, provides a measure of dataset compatibility.In the construction of TS we have chosen a particular K as the number of nearest neighbors.Of course, there is not an a priori optimal value of K to choose.In the following analyses we will use a particular choice of K, and we will comment on the possibility of extending the algorithm with adaptive K in Section 4.2.
Now that we have a test statistic which correctly encodes the degree of compatibility between two data samples, and its asymptotic properties are ensured by theorems, we need to associate a probability with the value of the TS calculated on the given samples, as described in the next section.

Distribution of the test statistic and p-value
In order to perform a hypothesis test, we need to know the distribution of the test statistic f (TS|H 0 ) under the null hypothesis H 0 , to be used to compute the p-value.Classical statistical tests have well-known distributions of the test statistics, e.g.normal, χ 2 or Student-t.In our case, the distribution of TS is not theoretically known, for finite sample sizes.Therefore, it needs to be estimated from the data samples themselves.We employ the resampling method known as the permutation test [39,40] to construct the distribution f (TS|H 0 ) of the TS under the null hypothesis.It is a non-parametric (distribution-free) method based on the idea of sampling different relabellings of the data, under the assumption they are coming from the same parent PDF (null hypothesis).
In more detail, the permutation test is performed by first constructing a pool sample by merging the two samples: U = B∪T , then randomly shuffle (sampling without replacement) the elements of U and assign the first N B elements to B, and the remaining N T elements to T .Next, one computes the value of the test statistic on T .If one repeats this procedure for every possible permutation (relabelling) of the sample points, one collects a large set of test statistic values under the null hypothesis which provides an accurate estimation of its distribution (exact permutation test).However, it is often impractical to work out all possible permutations, so one typically resorts to perform a smaller number N perm of permutations, which is known as an approximate (or Monte-Carlo) permutation test.The TS distribution is then reconstructed from the N perm values of the test statistic obtained by the procedure outlined above.
The distribution of the test statistic under a permutation test is asymptotically normal with zero mean in the large sample limit N B , N T → ∞ [40], as a consequence of the Central Limit Theorem.Furthermore, when the number N perm is large, the distribution of the pvalue estimator approximately follows a normal distribution with mean p and variance p(1 − p)/N perm [39,41].For example, if we want to know the p-value in the neighborhood of the significance level α to better than α/3, we need N perm > 9(1 − α)/α, which is of the order of 1000 for α = 0.01.
Once the distribution of the test statistic is reconstructed, it is possible to define the critical region for rejecting the null hypothesis at a given significance α, defined by large enough values of TS obs such that the corresponding p-value is smaller than α.
As anticipated in Section 2.2, for finite samples the test statistic distribution is still approximately symmetric around the mean, but the latter may deviate from zero.In order to account for this general case, and give some intuitive meaning to the size of the test statistic, it is convenient to standardize (or 'studentize') the TS to have zero mean and unit variance.Let μ, σ be the mean and the variance of test statistic under the distribution f (TS|H 0 ).We then transform the test statistic as which is distributed according to ) Output: p-value of the null hypothesis.U n ← randomly reshuffle B ∪ T T ← remaining N T elements of U n 10: end for 14:  with zero mean and unit variance.With this redefinition, the two-sided p-value can be easily computed as (2.11)

Summary of the algorithm
The pseudo-code of the algorithm presented in this paper is summarized in Table 1.We implemented it in Python and an open-source package is available on GitHub2 .3 Applications to simulated data

Case study: Gaussian samples
As a first case study of our method let us suppose we know the original distributions from which the benchmark and trial samples are randomly drawn.For instance, let us consider the two-dimensional case (D = 2) with two-variate Gaussian distributions defined by mean vectors µ B,T and covariance matrices Σ B,T : In general, in the case of multi-variate Gaussian distributions of dimension D, the KL divergence can be computed analytically (see Eq. (A.4)).
In the large sample limit, we recover that the test statistic converges to the true KL divergence between the PDFs (see Figure 2).Of course, the comparison is possible because we knew the parent PDFs p B , p T .
For our numerical experiments we fix the benchmark B sample by the parameters µ B = (1.0,1.0) T , Σ B = ((1.0,0.0), (0.0, 1.0)), and we construct 4 different trial samples T G0 , T G1 , T G2 , T G3 drawn by Gaussian distributions whose parameters are defined in Table 2.Each sample consists of 20 000 points randomly drawn from the Gaussian distributions defined above.Notice that the first trial sample T G0 is drawn from the same distribution as the benchmark sample.To help the reader to visualize the situation, the benchmark and T G3 trial samples are shown in Figure 3.
The running time to compute the p-value on a standard laptop for two 2-dimensional samples of 20 000 points each, and for 1000 permutations, was about 2 minutes.The running time scales linearly with the number of permutations.The number of sample points (N B,T ) plays an important role.As an example, we sampled the same datasets B, T G0 , T G1 , T G2 , T G3 with N B = N T = 2000 points, i.e. ten times less points than for the cases shown in Table 2.The results for the equivalent significance for T G0 , T G1 , T G2 , T G3 are Z = 1.4σ, 1.9σ, 1.9σ, 2.3σ, respectively.Clearly, the test is not able to reject the null hypothesis at more than 99%CL (p-value is never below 0.01, which would correspond to Z > 2.6σ) in none of the cases.As another illustration of this point, we run the statistical test for B = T G0 vs T = T G3 for different sample sizes N B = N T and show the resulting Z significance in Figure 4. We find that for N B ≤ 10 4 , the test is not able to reject the null hypothesis at more than 99%CL.
Therefore, the power of our statistical test increases for larger sample sizes, as expected since bigger samples lead to more accurate approximations of the original PDFs.

Case study: Monojet searches at LHC
A model-independent search at the LHC for physics Beyond the Standard Model (BSM), such as Dark Matter (DM), has been elusive [3][4][5][6].Typically it is necessary to simulate the theoretical signal in a specific model, and compare with data to test whether the model is excluded.The signal-space for DM and BSM physics in general is enormous, and despite thorough efforts, the possibility exists that a signal has been overlooked.The compatibility test described in Section 2 is a promising technique to overcome this challenge, as it can search for deviations between the expected simulated Standard Model signal and the true data, without any knowledge of the nature of the new physics.
In a real application of our technique by experimental collaborations, the benchmark dataset B will be a simulation of the SM background, while the trial dataset T will consist of real measured data, potentially containing an unknown mix of SM and BSM events.As a proof-of-principle, we test whether our method would be sensitive to a DM signature in the monojet channel.For our study, both B and T will consist of simulated SM events ('background'), however T will additionally contain injected DM events ('signal').The goal is to determine whether the algorithm is sensitive to differences in B and T caused by this signal.

Model and simulations
The signal comes from a standard simplified DM model (see e.g.Ref. [42] for a review) with Fermion DM χ and an s-channel vector Z mediator [43,44].Our benchmark parameters are g χ = 1, g q = 0.1, g = 0.01, in order to match the simplified model constraints from the ATLAS summary plots [45].We use a DM mass of 100 GeV, and mediator masses of (1200, 2000, 3000) GeV, in order to choose points that are not yet excluded but could potentially be in the future [45].
Signal and background events are first simulated using MG5 aMC@NLO v2.6.1 [46] at center-of-mass energy √ s = 13 TeV, with a minimal cut of E miss T > 90 GeV, to emulate trigger rather than analysis cuts.We use Pythia 8.230 [47] for hadronization and Delphes 3.4.1 [48] for detector simulation.The so-called 'monojet' signal consists of events with missing energy from DM and at least one high-p T jet.The resulting signal cross-section is σ signal = (20.4,3.8, 0.6) pb for M med = (1200, 2000, 3000) GeV respectively.For the background samples, we simulate 40 000 events of the leading background, Z → ν ν + nj where n is 1 or 2, resulting in a cross section of σ background = 202.6 pb.
The Delphes ROOT file is converted to LHCO and a feature vector is extracted with Python for each event, consisting of p T and η for the two leading jets; the number of jets; missing energy E miss T ; Hadronic energy H T ; and ∆φ between the leading jet and the missing energy.Together this gives an 8-dimensional feature vector (D = 8), which is scaled to zero-mean unit-variance based on the mean and variance of the background simulations.This feature vector is chosen to capture sufficient information about each event while keep running time of the algorithm reasonable.Other choices of the feature vector could be chosen to capture different aspects of the physical processes, including higher-or lowerlevel features, such as raw particle 4-vectors.Application of high-performance computing resources would allow the feature vector to be enlarged, potentially strengthening results.A full study of the choice of feature vector is left to future work.Our simulation technique is simple and designed only as a proof of principle; we do not include sub-leading SM backgrounds, nor full detector effects, adopting a generic Delphes profile.

Test Statistic distribution under null hypothesis
Following the technique described in Section 2, for each of the 3 considered points in signal model parameter space, we first construct an empirical distribution of the test statistic under the null hypothesis, f (TS|H 0 ), and we then measure TS obs and compute the pvalue to determine the compatibility of the datasets.We choose K = 5 and f (TS|H 0 ) is constructed over N perm = 3000.
The pool sample B ∪ T consists of the 40 000 background events, along with a number of signal events proportional to the signal cross-section.We define B and T as having an equal number of background events, so that N signal = 20 000 × σ signal /σ background , N T = 20 000 + N signal .The resulting distribution of TS under the null hypothesis is shown in Fig. 5.The simulations are relatively fast, taking approximately an hour per 1000 permutations on a standard laptop, although computation time grows as a power-law with the number of events, such that further optimization and high-performance computing resources will be a necessity for application to real LHC data with many thousands of events.The statistics of f (TS|H 0 ) converge quickly, as shown in Fig. 6, consistent with the discussion of N perm in Section 2.4, and showing that N perm is more than sufficient.
Note that since B, T are chosen from permutations of B ∪ T , it is not necessary to specify how the 40 000 background events are divided between B and T ; It is only necessary to specify N B and N T at this point.

Observed Test Statistic
To test whether the null hypothesis would be excluded in the event of an (otherwise unobserved) DM signal hiding in the data, we calculate TS obs using B containing only background, and T containing background plus a number of signal events proportional to the relative cross section.In a practical application of this technique by the experimental collaborations, B would instead correspond to background simulations, while T would be the real-world observation; therefore only one measurement of TS obs would be performed.
However, in our case the distribution of TS under the null hypothesis is insensitive to the way the 40 000 background events are divided between B and T .Therefore we can simulate multiple real-world measurements of TS obs by dividing the 40 000 background events between B and T in different permutations (always keeping 20 000 background events in each sample).This allows us to be more robust: since TS obs is itself a random variable, multiple measurements of TS obs allows us to avoid the claim of a small p-value, when in reality the algorithm may not be sensitive to a small signal.
The calculation of TS obs is performed for 100 random divisions.The p-value and significance Z of each TS obs are calculated with respect to the empirical distribution f (TS|H 0 ) where possible.In many cases, TS obs is so extreme that it falls outside the measured range of f (TS|H 0 ), in which case p and Z are determined from a Gaussian distribution with mean μ and variance σ2 .This is equivalent to assuming that f (TS|H 0 ) is well-approximated by a Gaussian, which is true to a good approximation, as seen in Fig. 5.To be conservative, the technique is only considered sensitive to the signal if all simulated observations of TS exclude the null hypothesis, i.e. we show the minimum Z significance (and maximum p-value).These results are shown in Table 3.2, where we see that the background-only hypothesis is strongly excluded for T 1 and T 2 , even though these points are not yet excluded by traditional LHC searches.Bear in mind that this is a proof-of-concept, and real-world results are unlikely to be as clean, as discussed in Section 3.3.

Discussion
To study the threshold to which this technique is sensitive, we can construct T by adding an arbitrary number of signal events to the background, without reference to the relative signal cross-section.The result is shown in Figure 7 (left panel), using the signal dataset Overlayed is a Gaussian distribution with the same mean and standard deviation as the data.

Sample
M with M med = 1200 GeV.For each value of N sig , the distribution f (TS|H 0 ) is constructed over 1000 permutations, and the Z significance is determined through taking the minimum value of Z over 100 measurements of T obs for different background permutations.There is a clear threshold, below which the significance is negligible and constant, and above which the significance grows as a power-law.The number of signal events in T 2 crosses this threshold while T 3 does not, explaining the rapid drop in the significance.The strength of the technique is also sensitive to the number of samples.Figure 7 (right panel) demonstrates this, again using the signal dataset with M med = 1200 GeV, N perm = 1000, and taking the minimum Z over 100 measurements of T obs .It shows an approximately power-law growth in the significance, consistent with the same growth in the significance with number of signal events.Clearly, the more data the better.

Future application to real data
In a practical application of this technique by experimental collaborations, B would correspond to simulations of the SM background, while T would be the real-world observation, consisting of an unknown mix of signal and background events.Both B and T could be constructed under the same set of minimal cuts, imposed based on trigger requirements rather than as a guide to finding new physics.While the technique itself is model-independent, there is freedom to apply physical knowledge in the choice of minimal cuts to keep the background simulation and data load manageable, and in the choice of feature vector, which can either be low-level (raw 4-vectors of reconstructed objects, or even pixel hits) or high-level (missing energy, hadronic energy etc.).
Even though we have only applied our method to a generic monojet signal, the strength  of the algorithm is that it is sensitive to unspecified signals, and is limited only by the accuracy of the background simulation.We emphasize that our case study in Section 3.2 is a proof of concept with a generic signal and a naïve estimation of the background.
Accurately estimating SM backgrounds at the LHC is a significant challenge in the field and must be considered carefully in any future application of this technique.Currently used techniques of matching simulations to data in control regions still allow the use of our method, although this introduces some model-dependent assumptions.Alternatively, one may apply our statistical test in the context of data-driven background calculation, as a validation tool to measure the compatibility of Monte-Carlo simulations with data in control regions.
Our statistical test alone is not sufficient to claim discovery in cases where background simulations are not sufficiently accurate, but this does not weaken the value of the method.It remains valuable as a tool to identify regions of excess in a model-independent way, allowing follow-up hand-crafted analyses of potential signal regions.

Directions for extensions
In this section we summarize three main directions to extend and improve the method proposed in this paper.We limit ourselves to just outlining some ideas, leaving a more complete analysis of each of these issues to future work.

Inclusion of sample uncertainties
So far we have considered that both B and T samples are precisely known.However, in several situations of physical interest this may not be the case, as the features may be known only with some uncertainty.There can be several factors affecting the precision with which each sample point x j is known, for instance systematic uncertainties (e.g. the smearing effects of the detector response) and the limited accuracy of the background (Monte-Carlo simulation), which may be particularly poor in some regions of the feature space.
Of course, once such uncertainties are properly taken into account, we expect a degradation of the results of the statistical test described in this paper, leading to weaker conclusions about the rejection of the null hypothesis.
In light of these effects, which we have neglected so far, we reiterate that the results of Section 3.2 should be taken with a grain of salt: they show the strengths of the statistical test we are proposing and prove it is worthwhile to investigate it further, but they will very likely be weakened in a real-world situation.

Adaptive choice of the number of nearest neighbors
The procedure for the density ratio estimator described in Section 2.3 relies on choosing the number K of NN.As mentioned earlier, it is also possible to make the algorithm completely unsupervised by letting it choose the optimal value of K.
One approach is to proceed by model selection as in Refs.[29,37,49].We define the loss function as a mean-squared error between the true (unknown) density ratio r(x) = p T (x)/p B (x) and the estimated density ratio r(x) = pT (x)/p B (x) over the benchmark PDF p B (x), where the last term is constant and can be dropped, thus making the loss function independent of the unknown ratio r(x).The estimated loss function is obtained by replacing the expectations over the unknown PDF p B with the empirical averages So, one can perform model selection by minimizing the estimated loss function (4.3) with respect to the parameter K and choosing this value of K as the optimal one.However, this procedure may be computationally intensive as it requires running the full algorithm several times (one for each different value of K).
Another approach is to implement the Point-Adaptive k-NN density estimator (PAk) [50][51][52], which is an algorithm to automatically find a compromise between large variance of the k-NN estimator (for small K), and large bias (for large K) due to variations of the density of points.

Identifying the discrepant regions
Suppose that after running the statistical test described in this paper one finds a p-value leading to a rejection of the null hypothesis, or at least for evidence of incompatibility between the original PDFs.This means that the absolute value of the test statistic on the actual samples |TS obs | is large enough to deviate from zero significantly (to simplify the discussion, we assume in this subsection that TS obs > 0 and the distribution of TS has zero mean and unit variance: μ = 0, σ = 1).Then, our algorithm has a straightforward by-product: it allows to characterize the regions in feature space which contribute the most to a large TS obs .
From the expression of the test statistic in Eq. (2.8) we see that we may associate a density field (x j ) to each point x j ∈ T as such that the test statistic is simply given by the expectation value (arithmetic average) of u(x j ) over the whole trial sample T It is then convenient to define a z-score field over the trial sample, by standard normalization of u(x j ) as One can then use this score field to identify those points in T which are significantly larger than TS obs , and they can be interpreted as the regions (or clusters) where the two samples manifest larger discrepancies.This way, the z-score field provides a guidance for characterizing the regions in feature space where the discrepancy is more relevant, similar in spirit to regions of large signal-tobackground ratio.For instance, the points x j with z(x j ) larger than a given threshold, e.g.z(x j ) > 3, are the points where one expects most of the "anomaly" to occur.An example of this is shown in Figure 8, where a circular B sample is compared with a cross-like T sample.As expected, the z-field has higher density in correspondence of the corners of the cross.Such regions of highest incompatibility between trial and benchmark samples may even be clustered using standard clustering algorithms, thus extending the method studied in this paper with another unsupervised learning technique.
Once they have been characterized and isolated, these high-discrepancy regions in feature space can provide a guidance for further investigation, in order to identify what causes the deviations.For example, they can be used to place data selection cuts.

Conclusions
Many searches for new phenomena in physics (such as searches for New Physics at the LHC) rely on testing specific models and parameters.Given the unknown nature of the physical phenomenon we are searching for, it is becoming increasingly important to find model-independent methods that are sensitive to an unknown signal hiding in the data.
The presence of a new phenomenon in data manifests itself as deviations from the expected distribution of data points in absence of the phenomenon.So, we propose a general statistical test for assessing the degree of compatibility between two datasets.Our method is model-independent and non-parametric, requiring no information about the parameters or signal spectrum of the new physics being tested; it is also un-binned, taking advantage of the full multi-dimensional feature space.
The test statistic we employ to measure the 'distance' between two datasets is built upon a nearest-neighbors estimation of their relative local densities.This is compared with the distribution of the test statistic under the null hypothesis.Observations of the test statistic at extreme tails of its distribution indicate that the two datasets come from different underlying probability densities.
Alongside an indication of the presence of anomalous events, our method can be applied to characterize the regions of discrepancy, providing a guidance for further analyses even in the case where one of the two samples (e.g. the background) is not known with enough accuracy to claim discovery.
The statistical test proposed in this paper has a wide range of scientific and engineering applications, e.g. to decide whether two datasets can be analyzed jointly, to find outliers in data, to detect changes of the underlying distributions over time, to detect anomalous events in time-series data, etc.
In particular, its relevance for particle physics searches at LHC is clear.In this case the observed data can be compared with simulations of the Standard Model in order to detect the presence of New Physics events in the data.Our method is highly sensitive even to a small number of these events, showing the strong potential of this technique.

Figure 1 .
Figure 1.Schematic view of the proposed method to compute the p-value of the null hypothesis that the two samples are drawn from the same probability density.

Figure 3 .
Figure 3. Benchmark (magenta crosses) and Trial T G3 (blue squares) samples of 20 000 2dimensional points randomly drawn from gaussians with unit covariance matrices and mean vectors: µ B = (1.0,1.0) T , µ T = (1.15,1.15) T .Although the two samples are very similar to each other (the means of the original PDFs are shifted by 15% along each axis) our statistical test can distinguish them at 4.9σ significance.

Figure 4 .
Figure 4.The Z significance of the test for different sample sizes.The B,T samples have the same size N B = N T .We compare B = T G0 and T = T G3 , as defined in Table 2, using K = 5 and N perm = 1000.

3 Figure 5 .
Figure 5. Distribution of the test statistic under the null hypothesis for our 3 signal points.Overlayed is a Gaussian distribution with the same mean and standard deviation as the data.

Figure 6 .
Figure 6.Effect of N perm on the null-hypothesis test statistic for the monojet study with T 2 .

Figure 7 .
Figure 7.The effect of N sig (left) and N B (right) on the ability of the algorithm to distinguish B and T .For the left figure, N B = 20 000 background events and N T = N B + N sig .(Based on the actual simulated signal and background cross-sections, the true value is N sig = 2010.)In the right figure, N T = N B + N sig , where N sig varies in proportion to N B and the relative signal/background cross-sections.In both cases, we use the trial sample T 1 corresponding to the signal with M med = 1.2 TeV.

Figure 8 .
Figure 8. Upper panel: benchmark (magenta crosses, left) and trial (blue squares, right) samples.Lower panel: points of trial sample with z > 3.0; this condition isolates the regions where most of the discrepancy between samples occurs.

Table 1 .
Pseudo-code for the two-sample test algorithm, using nearest neighbors density ratio estimation.

Table 3 .
Summary of monojet results comparing B (background only) with T (background plus DM signal).The cross section corresponding to the trial sample is simply given by σ T = σ background + σ signal .The p-value and Z statistic show the compatibility between B and T ; Large Z indicates that T is not consistent with the background-only hypothesis.