Machine-Learned Exclusion Limits without Binning

Machine-Learned Likelihoods (MLL) combines machine-learning classification techniques with likelihood-based inference tests to estimate the experimental sensitivity of high-dimensional data sets. We extend the MLL method by including Kernel Density Estimators (KDE) to avoid binning the classifier output to extract the resulting one-dimensional signal and background probability density functions. We first test our method on toy models generated with multivariate Gaussian distributions, where the true probability distribution functions are known. Later, we apply the method to two cases of interest at the LHC: a search for exotic Higgs bosons, and a $Z'$ boson decaying into lepton pairs. In contrast to physical-based quantities, the typical fluctuations of the ML outputs give non-smooth probability distributions for pure-signal and pure-background samples. The non-smoothness is propagated into the density estimation due to the good performance and flexibility of the KDE method. We study its impact on the final significance computation, and we compare the results using the average of several independent ML output realizations, which allows us to obtain smoother distributions. We conclude that the significance estimation turns out to be not sensible to this issue.

Recently, a simplification of [21] has been proposed in [36], the so-called Machine-Learned Likelihoods (MLL), which computes the expected experimental sensitivity through the use of ML classifiers, utilizing the entire discriminant output.A single ML classifier estimates the individual probability densities and subsequently one can calculate the statistical significance for a given number of signal and background events (S and B, respectively) with traditional hypothesis tests.By construction, the output of the classifier is always one-dimensional, so we reduce the hypothesis test to a single parameter of interest, the signal strength µ.On the one hand, it is simply and reliably applicable to any high-dimensional problem.On the other hand, using all the information available from the ML classifier does not require defining working points like traditional cut-based analyses.The ATLAS and CMS Collaborations incorporate similar methods in their experimental analyses but consider only the classifier output as a good variable to bin and fit the binned likelihood formula (see, for instance, Refs.[37][38][39][40][41][42][43][44]).
The MLL code [45] developed in [36] only includes the calculation of the discovery hypothesis test, although the expressions needed to calculate the exclusion limits were provided.In [46] we extend the MLL method by adding the exclusion hypothesis test.It is well-known that unbinned methods could provide a better performance than binned ones since the loss of information is minimized.In that sense, in this work we improve the MLL method with the use of Kernel Density Estimators (KDE) [47,48], in order to avoid binning the ML classifier output for extracting the resulting one-dimensional signal and background probability density functions (PDFs), as proposed in [36,46].The implementation of unbinned methods to the ML output space has intrinsic difficulties that are usually not present if one considers physical based features, specifically the stochasticity of the machine learning training introduces fluctuations, even when the classifier approaches its optimal limit.These fluctuations translate to non-smooth distribution functions, that in turn, are propagated by the KDE into the density estimation given the plasticity of this consistent non-parametric method [49].Therefore, it is necessary to analyze the impact of the lack of smoothness in the statistical analysis.We propose to tackle this issue by working with a variable build from the average of several independent machine-learning realizations, that gives smoother PDFs.
We would like to highlight that binned methods are commonly used since one can usually optimize the binning to extract nearly all of the benefits of the unbinned approach, but this optimization can be a highly non-trivial and scenario-dependent task.The incorporation of KDE within our framework allows to automatically elude any binning optimization and outperform some of the most common binning schemes.For illustration, we compare the results of our unbinned MLL method with the results obtained by doing linear and non-linear binnings in the toy examples used to validate our setup, where the true PDFs are known.
The structure of the paper is the following: Section 2 is devoted to summarizing the main features of the MLL method with the relevant expressions for the calculation of exclusion limits and the implementation of KDE in it.In Section 3 we show the performance of the MLL method with KDE and analyze the application of this unbinned method to the ML output space in different examples: in Section 3.1 a case where the true probability density functions (PDFs) are known, through a toy model generated with multivariate Gaussian distributions; in Section 3.2 we present an LHC analysis for the search for new heavy neutral Higgs bosons at √ s = 8 TeV and luminosity of 20 fb −1 , estimating not only exclusion limits, but also comparing our results with those report in [12]; and in Section 3.3 we present an HL-LHC study for Sequential Standard Model (SSM) [50] Z ′ bosons decaying into lepton pairs, comparing the MLL+KDE performance for estimating 95% CL exclusion limits with the results obtained applying a binned likelihood to the machine learning classifier output and also with respect to the projections reported by the ATLAS Collaboration for an LHC center-of-mass energy of √ s = 14 TeV with a total integrated luminosity of L = 3 ab −1 [51].Finally, Section 4 summarizes our more important results and conclusions.

Method
In this section, we present the corresponding formulae for the estimation of exclusion sensitivities with the MLL method, first introduced in [36,46].We summarize the main features of the method which allows dealing with data of arbitrarily high dimension through a simple ML classifier while using the traditional inference tests to compare a null hypothesis (the signal-plus-background one) against an alternative one (the background-only one).We also present the details of the implementation of KDE to obtain the unbinned posterior probability distributions from the classifier output, needed to compute the corresponding likelihood functions.
Following the statistical model in [52], we can define the likelihood L of N independent measurements with an arbitrarily high-dimensional set of observables x as where S (B) is the expected total signal (background) yield, Poiss stands for a Poisson probability mass function, and p(x|µ, s, b) is the probability density for a single measurement x, where µ defines the hypothesis we are testing for.We can model the probability density containing the event-by-event information as a mixture of signal and background densities where p s (x) = p(x|s) and p b (x) = p(x|b) are, respectively, the signal and background probability density functions (PDFs) for a single measurement x, and µS µS+B and B µS+B are the probabilities of an event being sampled from the corresponding probability distributions.
To derive upper limits on µ, and in particular considering additive new physics scenarios (µ ≥ 0), we need to consider the following test statistic for exclusion limits [53]: where μ is the parameter that maximizes the likelihood in Eq. ( 1) Considering our choice for the statistical model in Eq. ( 1), qµ turns out Since p s,b (x) are typically not known, the base idea of our method in [36] is to replace these densities for the one-dimensional manifolds that can be obtained for signal and background from a machine-learning classifier.After training the classifier with a large and balanced data set of signal and background events, it can be obtained the classification score o(x) that maximizes the binary cross-entropy (BCE) and thus approaches [21,54] as the classifier approaches its optimal performance.The dimensionality reduction can be done by dealing with o(x) instead of x, using where ps,b (o(x)) are the distributions of o(x) for signal and background, obtained by evaluating the classifier on a set of pure signal or background events, respectively.Notice that this allows us to approximate both signal and background distributions individually, retaining the full information contained in both densities, without introducing any working point.These distributions are one-dimensional, and therefore can always be easily handled and incorporated into the test statistic in Eq. ( 5) as well as into the condition on μ from Eq. ( 4) The test statistic in Eq. ( 8) is estimated through a finite data set of N events and thus has a probability distribution conditioned on the true unknown signal strength µ ′ .For a given hypothesis described by the µ ′ value, we can estimate numerically the qµ distribution.When the true hypothesis is assumed to be the background-only one (µ ′ = 0), the median expected exclusion significance med where we estimate the qµ distribution by generating a set of pseudo-experiments with backgroundonly events.Then, to set upper limits to a certain confidence level, we select the lowest µ which achieves the required median expected significance.
It is worth remarking that the output of the machine learning classifier, for a given set of events, gives us a sample of the desired PDFs ps,b (o(x)).Hence, to apply Eq. ( 8) we first need to extract the classifier posteriors.As these samples are one-dimensional, we can always compute binned PDFs, as was done in [36].Binning the output variable is a typical procedure when using ML tools.Nevertheless, it is also possible to compute the PDFs through other parametric (such as Mixture Models [55]) or non-parametric methods (such as Kernel Density Estimation (KDE) [47,48] or Neural Density Estimation [49]).In comparison with other density-estimation methods, KDE has the advantage of not assuming any functional form for the PDF, in contrast with the mixture of Gaussian methods, while keeping the computation and the interpretation simple, as opposed to neural density estimation methods.For this reason, in this work, we made extensive use of the KDE method1 , through its scikit-learn implementation [57].
Given a set of N events that were previously classified by the machine learning as signal (background) events, the PDF estimated by the KDE method is defined as where κ ϵ is a kernel function that depends on the "smoothing" scale, or bandwidth parameter ϵ.There are several different options for the kernel function.In this work, we used the Epanechnikov kernel [58] as it is known to be the most efficient kernel [49].This kernel is defined as It is important to remark that the "bandwidth" parameter ϵ censors the degree of smoothness.Hence, a very low ϵ will overfit the data, whereas a very high ϵ will underfit it.In all our examples the ϵ was selected through a grid search done using the GridSearchCV function inside the sklearn.model_selectionpython package.Given a value for ϵ, this function estimates the log-likelihood of the data using a 5-fold cross-validation strategy, i.e. the data set is split into 5 smaller sets, 4 are used to fit the KDE which is then validated on the remaining part of the data.Finally, the function gives as an output the ϵ which maximizes the data likelihood.Also is worth remarking that although KDEs method suffers from the curse of dimensionality, we are applying such technique to the one-dimensional output of the machine learning classifier to avoid this problem.
Notice that the machine learning training (and hence the machine learning predictions) is a stochastic process that introduces small fluctuations around the optimal limit.These in turn could translate to non-smooth PDFs.To tackle this issue, the same procedure described above can be done when using an ensemble of N base classifiers trained on random subsets of the original data set, that average their individual predictions to form a final prediction.In this case, o(x) can simply be replaced by < o(x) >= 1 For completeness, we also introduce here the median exclusion significance estimation for the traditional Binned Likelihood (BL) method and the use of Asimov data sets [53], which will be used to compare our technique where S d and B d are the expected number of signal and background events in each bin d.This approximation is very effective but runs into trouble when the dimension of the data grows, which is known as the curse of dimensionality since the number of data points required to reliably populate the bins scales exponentially with the dimension of the observables x.This is a non-existent problem in our method, which always reduces the original dimension to one as stated in Eq. ( 7), allowing the application of the BL method to the classifier output, as also done by experimental collaborations when using ML methods, as mentioned in Section 1.
3 Application examples

Known true PDFs: multivariate Gaussian distributions
To show the performance of the MLL method with KDE we first analyze toy models generated with multivariate Gaussian distributions of different dimensions, with mean m, and covariance matrix Σ.
We start with the simplest case, consisting of an abstract two-dimensional space (x 1 , x 2 ).Events are generated by Gaussian distributions N 2 (m, Σ), with m = +0.3(−0.3)and no correlation, i.e., covariance matrices Σ = I 2×2 for S (B).We trained supervised per-event classifiers, XGBoost, with 1M events per class (balanced data set), to distinguish S from B. The PDFs obtained from the classifier output, o(x), can be found in the top left panel of Figure 1, for two new independent data sets of pure signal (blue) and pure background (red) events.
Since in this example we know the true underlying distributions in the original multidimensional space, we can test Eq.( 6).In the right panel of Figure 1 we show, in green dots, the output of one machine learning realization vs. the right-hand-side of Eq. ( 6) estimated with the real signal and background probability functions.We can observe that the classifier approaches the optimal limit, although there are some small fluctuations around the 1-to-1 line.i o i (x).Right panel: Comparison between our trained classifier output and the mathematically optimal performance defined in Eq. ( 6).These fluctuations are independent of the sampling of the data and come from the stochasticity inherent to any machine learning training process.In turn, these fluctuations translate to non-smooth PDFs for the machine learning output of background and signal events, as can be seen in the red and blue shadow histograms in the top left panel of Figure 1.
As explained before, to solve this issue we can take advantage of ensembles, and build a variable from the average output of ten independent machine learning realizations, define as < o(x) >= 1 10 10 i o i (x).It can be seen in the red and blue shadow histograms of the bottom left panel of Figure 1 that, with this definition, the small fluctuations are washed out resulting in smoother PDFs.For completeness on both left panels, we also present the estimations of ps,b (o(x)) using the true PDFs (orange and purple solid lines), the KDE over the machine learning output o(x) (red and blue dashed curves of top left panel), and KDE over the average variable < o(x) > (red and blue dashed lines of the bottom left panel).On the one hand, it can be seen that, due to the flexibility of the KDE method, when fitting the machine learning output o(x) the resulting distributions follows the fluctuations around the true PDFs.On the other hand, it can be seen that, the KDE distributions obtained when fitting the average variable are smooth and closely approach the true PDFs.
In Figure 2 we show the results for the MLL exclusion significance with KDE considering an example with a fixed background of ⟨B⟩ = 50k and different signal strengths.We also  3) for S (B) and no correlation, for fixed ⟨B⟩ = 50k, and different signal strengths ⟨S⟩.The red curves show the result of implementing the MLL+KDE method, while the blue and magenta curves represent the results obtained by applying the BL method to the classifier's one-dimensional output and the original two-dimensional space, respectively.Dashed curves use the output of a single classifier, while solid lines use the averaged output of 10 classifiers.For comparison, we include the green solid curve with the results obtained using the true PDFs.include the significance calculated using the true probability density functions in Eq. ( 3), and the results employing a binned Poisson log-likelihood of the original two-dimensional space (x 1 , x 2 ) with Eq. (12), which is possible to compute in this simple scenario.For completeness, we also include the results binning the one-dimensional ML output variable for obtaining the PDFs as in [36,46].As can be seen, since we are analyzing a simple example, the significances estimated with all the methods are indistinguishable from the ones estimated with the true PDFs, which is expected given the low dimensionality of the space.
We would like to highlight that the significance does not change significantly if we employ either o(x) computed with a single ML classifier, or with the averaged variable < o(x) > calculated ensembling several ML trainings.In addition, for both the MLL+KDE and the true PDF methods, the significance is estimated by generating a set of pseudo-experiments with a finite-size number of events.This introduces a small statistical fluctuation due to the randomness of the sample.
The advantage of the MLL+KDE method against traditional approaches appears when dealing with dim = n, with n > 2. In Figure 3 we present the exclusion significance for higher dimensional data generated with N n (m, Σ), no correlation Σ = I n×n , and m = +0.the original high-dimensional space.Also, it is interesting to note that, the results with the MLL+KDE method approach the ones with the true generative functions for all the analyzed dimensions.It is important to highlight that the ML output is always one-dimensional regardless of the dimension of the input data and, hence, can always be easily binned.For completeness, we show in Figure 3 the significances obtained by applying a BL method to the machine learning output with two different types of binning: a linear binning where all bins have the same size (in the one-dimensional output space), and a standard non-linear approach where all bins have the same number of background events (a binning strategy typically used by experimental collaborations since it avoids the presence of low-statistic bins in the background estimation, which in turns constraints systematic uncertainties).As can be seen, binning the output of the machine learning results in a non-negligible drop in significance.This can be understood as the binning introduce a loss of information due to a resolution effect.For this example, the linear binning turns out to be more effective for the BL method.
In addition, and as in the n = 2 example, for MLL+KDE the use of an ensemble of machine learning realizations to obtain smoother PDFs does not change the results obtained with one single classifier.The same is verified when using the BL method over the average variable, although this behavior is expected since this method creates histograms from the distributions.
In the left and right panels of Figure 4 we show the impact in the previous example of increasing the number of bins when applying BL to the classifier output, both for linear and non-linear bins, respectively.As stated before, linear binning proves to be a better sampling choice since its result approaches the ones obtained with the MLL+KDE method and with the true PDFs, when increasing the number of bins.Regarding the bins with the same number of background events, even though performance improves with more bins, the results are worst than its linear binning counterpart.This example shows the difficulties arising when trying to find an optimal binning that is not known a priori, and this highlights the advantage of using MLL+KDE, which although computationally expensive (when tuning the bandwidth parameter), sets an upper limit in the significance that can be achieved.It is also possible to automatically choose the optimal number of bins for histograms, as well as to tune the width of each bin, in a similar fashion as done for the ϵ parameter in the KDE method.
The results of this analysis can be found in Appendix A, where we show that the significances obtained optimizing the bin widths is similar to the ones assuming equal-sized bins, and hence, the MLL-KDE method still offers the best significances when compared to different binned multivariate approaches.Finally, in the right panel of Figure 5 we show a case with correlation, N n (m, Σ), with m = +0.7(−0.7)for S (B), and Σ ij = 1 if i = j and 0.5 if i ̸ = j.Comparing with the same example without correlation in the left panel of Figure 5, the correlation makes the signal and background more difficult to distinguish, hence we obtain lower significance values, with MLL+KDE still offering the best performance.
Although these are toy models they allow us to understand the performance of MLL with KDE method over problems of different complexity and demonstrate its improvement with respect to the BL method applied to the classifier output.Particularly the MLL+KDE has a stable behavior when increasing the dimensionality of the input space, as well as when increasing the separation of the signal and background distributions on the original abstract variables.On the other hand, the BL method applied to the classifier output departs from the results obtained with the true PDFs as the number of dimensions and separation of signal and background samples increases.The number of bins to use is another limitation, non-existent in our method that uses a non-parametric technique for PDF extraction.We also tested that although the KDE method is sensible to the fluctuations inherent to the machine learning classifier output, the lack of smoothness of the extracted PDFs does not affect the estimation of the significance within our framework.

New exotic Higgs bosons at the LHC
In this section, we apply our method in the search for an exotic electrically-neutral heavy Higgs boson (H 0 ) at the LHC, which subsequently decays to a W boson and a heavy electricallycharged Higgs boson (H ± ).This example was first analyzed with machine learning methods in Ref. [12].The exotic H ± decays to another W boson and the SM Higgs boson (h).Taking into account only the dominant decay of the Higgs boson, the signal process is defined as The background is therefore dominated by top-pair production, which also gives W ∓ W ± b b.
For our analysis, we use the same data presented in [12] that is publicly available at [59], which focus on the semi-leptonic decay mode of both background and signal events (one W boson decaying leptonically and the other one decaying hadronically), giving as final state ℓνjjb b.The data set consists of low-level variables (twenty-one in total, considering the momentum of each visible particle, the b-tagged information of all jets, and the reconstruction of the missing energy) and seven high-level variables (m jj , m jjj , m ℓν , m jℓν , m b b, m W b b and m W W b b), expected to have higher discrimination power between signal and background (see [12] for more details).The signal benchmark case corresponds to a m H 0 = 425 GeV and m H ± = 325 GeV.
For this example, we have trained three XGBoost classifiers with three different data representations: only low-level variables, only high-level variables, and combining both low and high-level features.For completeness we also add the result obtained when using an average variable obtained after ensembling 10 ML classifiers with all the input variables.In the left panel of Figure 6 we show the ROC curves for the analysis, and as expected, the best performance was achieved using both low and high-level features (for both the averaged and non-averaged variable).These results are in agreement with the analysis performed in [12]  Table 1: Expected cross-section upper limit at 95% C.L., considering ATLAS detector, √ s=8 TeV and luminosity of 20 fb −1 (B = 86k).Background process and cuts as discussed in the main text.obtained with different ML algorithms.In the following, we will work with the latter data representation to estimate the expected significance for the search for heavy Higgs.
To compute the expected background yield at the ATLAS detector at √ s=8 TeV and luminosity of 20 fb −1 , B ≃ 86k, we simulated background events with MadGraph5_aMC@NLO 2.6 [60], using PYTHIA 8 [61] for showering and hadronization, and Delphes 3 [62] for fast detector simulation.We applied all the selection cuts in [12], and checked that the different kinematic distributions from our simulation are in agreement with the ones from the public data set.With the expected background prediction, we scan over the expected signal yield, S, to be agnostic regarding the coupling values of the model.
The exclusion significance for different S √ B ratios are shown in the right panel of Figure 6.The results for the MLL+KDE methods do not yield significant differences and are shown as the red solid curve for the averaged variable of 10 ML and as dotted orange curve correspond for 1 ML.We also present as dashed curves the significance binning the one-dimensional ML output for different numbers of bins.We would like to remark that binning the original feature space is not possible due to its high dimensionality (twenty-one and seven low and high-level variables, respectively).Also for this collider example, it can be seen that the MLL+KDE method outperforms the results obtained with the binned likelihood procedure.
Since no excess has been found, we can compute the expected cross-section upper limit at 95%C.L. for the new exotic Higgs bosons search, which corresponds to the value of S √ B that gives Z = 1.64.The results are presented in Table 1.
For completeness, and to compare with the results of [12], we show in Table 2 the discovery significance for MLL+KDE and BL methods.Notice that for this calculation we artificially set B = 1000 and S = 100 to directly compare our results with the ones in [12].The significant improvement in this case is due to the use of the full ML output in both MLL+KDE and BL methods, while in Ref. [12] only a fraction of o(x) is used to define a signal enriched region with a working point.Method Z NN [12] 3.7σ DN [12] 5.0σ MLL+KDE 6.61σ MLL+KDE (10ML) 6.65σ o(x) BL (100 bins) 6.53σ o(x) BL (50 bins) 6.52σ o(x) BL (25 bins) 6.43σ o(x) BL (10 bins) 6.14σ Table 2: Discovery significances assuming B = 1000 and S = 100.For comparison, we also include the results for the same case shown in [12] using a shallow neural network (NN) and a deep neural network (DN).

SSM Z ′ boson decaying into lepton pairs at the HL-LHC
In this section, we analyzed the performance of our method on a simple collider example, namely the search for an SSM Z ′ boson decaying into lepton pairs at the HL-LHC.We generated sample events for signal and background with MadGraph5_aMC@NLO 2.6 [60], the showering and clustering were performed with PYTHIA 8 [61], and finally, the detector simulation was done with Delphes 3 [62].For the SM background, we considered the Drell-Yan production of Z/γ * → ℓℓ, with ℓ = e, µ, as in [51].As in the previous examples, we trained a XGBoost classifier, with 1M events per class, to distinguish S from B, for each Z ′ mass value, m Z ′ =[2.5, 3.5, 4.5, 5, 5.5, 6.5 , 7.5, 8.5] TeV, and final state (dielectron and dimuon).
We use as input parameters the transverse momentum |p T |, the azimuthal angle ϕ, and the pseudorapidity η of the final state leptons in each channel, the kinematic variables that can be extracted directly from the Delphes 3 output file.Considering the expected background prediction, for each parameter point we scan over S to obtain the expected signal yield upper limit at 95% C.L., corresponding to the value that gives Z = 1.64.Finally, we convert this yield to a cross section-upper limit that can be compared with the theoretical prediction.We are employing the same setup and detector level cuts as in the work presented by the ATLAS Collaboration at √ s = 14 TeV and 3 ab −1 [51], but we only generated signal and background events with dielectron and dimuon invariant masses above 1.8 TeV, and since we are dealing with a signal-enriched region and not the entire spectrum, the direct comparison with ATLAS projections for 95% CL exclusion limits is not strictly fair.This may enhance the performance of our classifier, since a Z ′ signal would appear as an excess at high dilepton invariant masses.However, the power of our method can be shown in the left (right) panel of Figure 7 for the dielectron (dimuon) channel when compared to the BL fit of the ML classifier output, which is on equal footing with the results for our method since it uses the same ML m Z (TeV)  SSM with MLL+KDE method (red solid curve corresponds to the averaged variable of 10 ML, and dotted cyan curve to 1 ML), and with the BL fit of the ML output using 15 linear bins (blue curve).The shaded area in each case includes a naive estimation of the significance uncertainty caused by the mass variation on each point, according to the systematic uncertainty for the invariant mass estimated by ATLAS in [51].Left panel: Dielectron channel.Right panel: Dimuon channel.classifier.Unbinning signal and background posteriors provide more constraining exclusion limits for both final states, and as in the previous examples, there is no significant difference between MLL+KDE using the output of 1ML or the averaged 10 ML.

Conclusions
The Machine-Learned Likelihoods method can be used to obtain discovery significances and exclusion limits for additive new physics scenarios.It uses a single classifier and its full onedimensional output, which allows the estimation of the signal and background PDFs needed for statistical inference.In this paper, we extend the MLL method to obtain exclusion significances and improve its performance by using the KDE method to extract the corresponding PDFs from the ML output.We found that the small fluctuations of the machine learning output around the optimal value translate into non-smooth PDFs.We verify that this problem can be handled by averaging the output of several independent machine-learning realizations.But mostly, we show that these small fluctuations do not have a major impact on the final significance.
Although the binning of the classifier output is always possible, irrespective of the dimensionality of the original variables, we verify that computing the PDFs with a non-parametric method such as KDE to avoid the binning, enhances the performance.By analyzing toy models generated with Gaussian distribution of different dimensions (with and without correlation between signal and background), we showed that MLL with KDE outperforms the BL method (with both linear and non-linear bins) when dealing with high-dimensional data, while for low-dimensional data all the methods converge to the results obtained with the true PDFs.Although it is a well-known fact that almost all the benefits of unbinned approaches can be obtained with optimal binning, avoiding such a (usually cumbersome) process is one of the main advantages of our work, providing an automatic way of estimating the probability density distributions through the KDE implementation.
Finally, we test the MLL framework in two physical examples.We found that, as expected, MLL also improves the exclusion-limits results obtained in a realistic Z ′ analysis as well as in the search for exotic Higgs bosons at the LHC, surpassing the ones computed with the simple BL fit of the ML one-dimensional output.
Last but not least, we would like to remark that this new version of MLL with KDE does not include systematic uncertainties in the likelihood fit, which is necessary for any realistic search.As this is a highly non-trivial issue for unbinned methods, we leave the inclusion of nuisance parameters to the MLL framework for future work.Nevertheless, we also highlight that even though likelihoods without uncertainties can not be used in most experimental setups, it could be useful in specific scenarios where the nuisance parameters can be considered small, and in phenomenological analyses as proofs of concept.possible ML output as each one is optimized taking into account the functional form of the data (size, shape, etc), which in our case is the output of our classifier that depends on the specific physical scenario.For example, the Sturges method is robust for Gaussian data, while FD method is good for large data samples.
In table 3 we show the optimal number of bins found by three different methods2 for N dim (m, Σ), with m = +0.3(−0.3)for S (B), no correlation, and fixed ⟨B⟩ = 50k and ⟨S⟩ = 500, corresponding to the Gaussian example introduced in Section 3.1.As stated before, since there is no general method to choose N , we decided to compare three methods whose assumptions fit some of our data set properties: FD due to our large sample, Sturges because at low dimensions the ML output resembles a normal distribution, and Doane to account the skew of the data for high dimensions.Additionally, in Figure 8 we show the significances obtained with these 3 methods for Gaussian distributions of different dimensions.It is important to highlight that all these methods assume equal size bin widths, and hence they show the same tendency already presented in the left panel of Figure 4, the significance increases with the number of bins.Table 3: Estimation of the optimal number of bins to describe the background data of the example shown in Figure 8, using FD, Doane and Sturges methods.
If we do not assume equal-size bins, we must tune (N -1) parameters (i.e. the width of each bin) where N is the number of bins.Unlike in the KDE method, now we are dealing with a high dimensional space.To perform a full exploration of this space is computationally expensive, therefore we performed a data-driven procedure to cross-validate the selection of the number of bins and each bin size as follows:    3. The gray dotted curve shows the significance using BL method with N bins, assuming non-equal bin widths.
and Sturges) applied to the background data set.
2. We randomly select the width of each bin.
3. We divide the background data set into 5 k-folds (as done in the KDE optimization).
For each bin d, we compute the mean number of events per bin, µ d = 1 ).
6. Finally, we select the binning that provides the minimum value of q poiss from among the 5000 iterations and compute the significance using the signal and an independent background sample.
This procedure provides a good trade-off between optimization and computational cost.In Figure 8 we show the obtained exclusion significance (Z).For each dimension, the significance found optimizing the bin widths is similar to the result assuming equal-sized bins (for the same number of bins).It is worth to remark that the MLL-KDE method consistently outperforms the binned multivariate analysis in terms of significance.For completeness, we have also check that similar conclusions can be drawn for alternative likelihood assumptions (for example, assuming a Gaussian distribution for each bin).

Figure 1 :
Figure 1: Results for the N 2 (m, Σ) case.Top left panel: Output of a single XGBoost classifier.Bottom left panel: Averaged output of 10 XGBoost classifiers, defined as o(x) = 1 10

Figure 2 :
Figure 2: Exclusion-limit significance for N 2 (m, Σ) with m = +0.3(−0.3)for S (B) and no correlation, for fixed ⟨B⟩ = 50k, and different signal strengths ⟨S⟩.The red curves show the result of implementing the MLL+KDE method, while the blue and magenta curves represent the results obtained by applying the BL method to the classifier's one-dimensional output and the original two-dimensional space, respectively.Dashed curves use the output of a single classifier, while solid lines use the averaged output of 10 classifiers.For comparison, we include the green solid curve with the results obtained using the true PDFs.

Figure 3 :
Figure 3: Exclusion-limit significance for N dim (m, Σ) with m = +0.3(−0.3)for S (B) and no correlation, as a function of the dim, for fixed ⟨B⟩ = 50k and ⟨S⟩ = 500.The red curves show the result of implementing the MLL+KDE method, while the blue and brown curves represent the results obtained by applying the BL method to the classifier one-dimensional output for 10 linear and no-linear bins, respectively.Dashed curves use the output of a single classifier, while solid lines use the averaged output of 10 classifiers.For comparison, we include the green solid curve with the results obtained using the true PDFs.

Figure 4 :
Figure 4: Exclusion-limit significance for N dim (m, Σ) with m = +0.3(−0.3)for S (B) and no correlation, as a function of the dim, for fixed ⟨B⟩ = 50k and ⟨S⟩ = 500.The red solid curve shows the result of implementing the MLL+KDE method, while the green curve shows the results obtained using the true PDFs.Dashed color curves represent the results obtained by applying the BL method to the classifier's one-dimensional output for different bin numbers.Left panel: linear binning.Right panel: non-linear binning (same number of B events per bin).

Figure 5 :
Figure 5: Exclusion-limit significance for N dim (m, Σ) with m = +0.7(−0.7)for S (B), as a function of the dim, for fixed ⟨B⟩ = 50k and ⟨S⟩ = 500.The red solid curve shows the result of implementing the MLL+KDE method, while the green curve shows the results obtained using the true PDFs.Dashed color curves represent the results obtained by applying the BL method (with linear bins) to the classifier one-dimensional output for different bin numbers.Left panel: covariance matrices Σ = I 2×2 (no correlation).Right panel: covariance matrices Σ ij = 1 if i = j and 0.5 if i ̸ = j.

Figure 6 :
Figure6: Left panel: ROC curves for the XGBoost classifiers associated with each possible data representation, trained to discriminate between H 0 and t t productions.Right panel: Exclusion limits for the search for a heavy neutral H 0 with MLL+KDE method (red solid line corresponds to the averaged variable of 10 ML, and dotted orange line corresponds to 1 ML), and with the BL fit for different number of linear bins (dashed curves), for fixed ⟨B⟩ = 86k, and different signal strengths ⟨S⟩.

Figure 7 :
Figure 7: Exclusion limits for the Z ′SSM with MLL+KDE method (red solid curve corresponds to the averaged variable of 10 ML, and dotted cyan curve to 1 ML), and with the BL fit of the ML output using 15 linear bins (blue curve).The shaded area in each case includes a naive estimation of the significance uncertainty caused by the mass variation on each point, according to the systematic uncertainty for the invariant mass estimated by ATLAS in[51].Left panel: Dielectron channel.Right panel: Dimuon channel.

Figure 8 :
Figure 8: Exclusion-limit significance for N dim (m, Σ) with m = +0.3(−0.3)for S (B) and no correlation, as a function of the dim, for fixed ⟨B⟩ = 50k and ⟨S⟩ = 500.The red solid curve shows the result of implementing the MLL+KDE method, while the green curve shows the results obtained using the true PDFs.The orange dashed curve represents the results obtained by applying the BL method to the classifier's one-dimensional output for 100 equalwidth bins, and the blue dashed curve for 10 equal-width bins.The black dot-dashed curve represents the BL method to the classifier's one-dimensional output for N equal bin width with N determined for each dimension by FD (top left panel ), Doane (top right panel ), and Sturges (bottom panel ), see Table3.The gray dotted curve shows the significance using BL method with N bins, assuming non-equal bin widths.
of events in each bin for the sample k.Notice that we used 4 k-folds.4. For the last k-fold, k = 5, assuming a Poissonian distribution for each bin and the Stirling approximation for log(N d !), we calculate q poiss = −logL = N bin d ( µ d − N (k=5) d log(µ d ) + N 3 , Z / Z MLL + KDE