1 Introduction

Modern machine learning (ML) has become a fundamental tool in experimental and phenomenological analyses of high-energy physics (for reviews see, for instance, [1,2,3,4,5,6,7,8,9] and for pioneer papers see [10,11,12]). The ML algorithms can be applied not only to event-by-event collider analyses but also used at the event-ensemble level [13,14,15,16,17,18,19,20]. In order to estimate the experimental sensitivity to potential new-physics signals at colliders, several studies have recently appeared that combine the use of ML classifiers with traditional statistical tests [21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36]. More specifically, it was shown in [21] that the calibration of classifiers trained to distinguish signal and background samples under the relevant hypotheses ensures to proper estimate the likelihood ratio and consequently can be used to compute a statistical significance.

Recently, a simplification of [21] has been proposed in [36], the so-called machine-learned likelihoods (MLL), which computes the expected experimental sensitivity through the use of ML classifiers, utilizing the entire discriminant output. A single ML classifier estimates the individual probability densities and subsequently one can calculate the statistical significance for a given number of signal and background events (S and B, respectively) with traditional hypothesis tests. By construction, the output of the classifier is always one-dimensional, so we reduce the hypothesis test to a single parameter of interest, the signal strength \(\mu \). On the one hand, it is simply and reliably applicable to any high-dimensional problem. On the other hand, using all the information available from the ML classifier does not require defining working points like traditional cut-based analyses. The ATLAS and CMS Collaborations incorporate similar methods in their experimental analyses but consider only the classifier output as a good variable to bin and fit the binned likelihood formula (see, for instance, Refs. [37,38,39,40,41,42,43,44]).

The MLL code [45] developed in [36] only includes the calculation of the discovery hypothesis test, although the expressions needed to calculate the exclusion limits were provided. In [46] we extend the MLL method by adding the exclusion hypothesis test. It is well-known that unbinned methods could provide a better performance than binned ones since the loss of information is minimized. In that sense, in this work we improve the MLL method with the use of kernel density estimators (KDE) [47, 48], in order to avoid binning the ML classifier output for extracting the resulting one-dimensional signal and background probability density functions (PDFs), as proposed in [36, 46]. The implementation of unbinned methods to the ML output space has intrinsic difficulties that are usually not present if one considers physical based features, specifically the stochasticity of the machine learning training introduces fluctuations, even when the classifier approaches its optimal limit. These fluctuations translate to non-smooth distribution functions, that in turn, are propagated by the KDE into the density estimation given the plasticity of this consistent non-parametric method [49]. Therefore, it is necessary to analyze the impact of the lack of smoothness in the statistical analysis. We propose to tackle this issue by working with a variable build from the average of several independent machine-learning realizations, that gives smoother PDFs.

We would like to highlight that binned methods are commonly used since one can usually optimize the binning to extract nearly all of the benefits of the unbinned approach, but this optimization can be a highly non-trivial and scenario-dependent task. The incorporation of KDE within our framework allows to automatically elude any binning optimization and outperform some of the most common binning schemes. For illustration, we compare the results of our unbinned MLL method with the results obtained by doing linear and non-linear binnings in the toy examples used to validate our setup, where the true PDFs are known.

The structure of the paper is the following: Sect. 2 is devoted to summarizing the main features of the MLL method with the relevant expressions for the calculation of exclusion limits and the implementation of KDE in it. In Sect. 3 we show the performance of the MLL method with KDE and analyze the application of this unbinned method to the ML output space in different examples: in Sect. 3.1 a case where the true probability density functions (PDFs) are known, through a toy model generated with multivariate Gaussian distributions; in Sect. 3.2 we present an LHC analysis for the search for new heavy neutral Higgs bosons at \(\sqrt{s}\) = 8 TeV and luminosity of 20 fb\(^{-1}\), estimating not only exclusion limits, but also comparing our results with those report in [12]; and in Sect. 3.3 we present an HL-LHC study for Sequential Standard Model (SSM) [50] \(Z^\prime \) bosons decaying into lepton pairs, comparing the MLL+KDE performance for estimating 95% CL exclusion limits with the results obtained applying a binned likelihood to the machine learning classifier output and also with respect to the projections reported by the ATLAS Collaboration for an LHC center-of-mass energy of \(\sqrt{s} =\) 14 TeV with a total integrated luminosity of \(\mathcal{L} =\) 3 ab\(^{-1}\) [51]. Finally, Sect. 4 summarizes our more important results and conclusions.

2 Method

In this section, we present the corresponding formulae for the estimation of exclusion sensitivities with the MLL method, first introduced in [36, 46]. We summarize the main features of the method which allows dealing with data of arbitrarily high dimension through a simple ML classifier while using the traditional inference tests to compare a null hypothesis (the signal-plus-background one) against an alternative one (the background-only one). We also present the details of the implementation of KDE to obtain the unbinned posterior probability distributions from the classifier output, needed to compute the corresponding likelihood functions.

Following the statistical model in [52], we can define the likelihood \(\mathcal {L}\) of N independent measurements with an arbitrarily high-dimensional set of observables x as

$$\begin{aligned} \mathcal {L}(\mu ,s,b) = \text {Poiss}\big (N|\mu S + B\big )\,\prod _{i=1}^{N}p(x_{i}|\mu ,s,b) , \end{aligned}$$
(1)

where S (B) is the expected total signal (background) yield, Poiss stands for a Poisson probability mass function, and \(p(x|\mu ,s,b)\) is the probability density for a single measurement x, where \(\mu \) defines the hypothesis we are testing for.

We can model the probability density containing the event-by-event information as a mixture of signal and background densities

$$\begin{aligned} p(x|\mu ,s,b) = \frac{B}{\mu S + B}\,p_{b}(x)+\frac{\mu S}{\mu S + B}\,p_{s}(x), \end{aligned}$$
(2)

where \(p_{s}(x)=p(x|s)\) and \(p_{b}(x)=p(x|b)\) are, respectively, the signal and background probability density functions (PDFs) for a single measurement x, and \(\frac{\mu S}{\mu S + B}\) and \(\frac{B}{\mu S + B}\) are the probabilities of an event being sampled from the corresponding probability distributions.

To derive upper limits on \(\mu \), and in particular considering additive new physics scenarios (\(\mu \ge 0\)), we need to consider the following test statistic for exclusion limits [53]:

$$\begin{aligned} \tilde{q}_{\mu } = {\left\{ \begin{array}{ll} 0 &{} \quad \text {if } \hat{\mu } > \mu ,\\ -2\text { Ln }\frac{\mathcal {L}(\mu ,s,b)}{\mathcal {L}(\hat{\mu },s,b)} &{} \quad \text {if } 0 \le \hat{\mu } \le \mu ,\\ -2\text { Ln }\frac{\mathcal {L}(\mu ,s,b)}{\mathcal {L}(0,s,b)} &{} \quad \text {if } \hat{\mu } < 0, \end{array}\right. } \end{aligned}$$
(3)

where \(\hat{\mu }\) is the parameter that maximizes the likelihood in Eq. (1)

$$\begin{aligned} \sum _{i=1}^{N}\frac{p_{s}(x_{i})}{\hat{\mu }S\, p_{s}(x_{i}) +B\, p_{b}(x_{i})} = 1 . \end{aligned}$$
(4)

Considering our choice for the statistical model in Eq. (1), \(\tilde{q}_{\mu }\) turns out

$$\begin{aligned} \tilde{q}_{\mu } = {\left\{ \begin{array}{ll} 0 &{} \textrm{if} \, \, \hat{\mu } > \mu \\ 2(\mu -\hat{\mu }) S \\ - 2 \sum _{i=1}^{N} &{} \\ \text { Ln } \left( \frac{B p_b(x_i)+\mu S p_s(x_i)}{B p_b(x_i)+\hat{\mu } S p_s(x_i)}\right) &{} \textrm{if} \, \, 0 \le \hat{\mu } \le \mu \\ 2\mu S - 2 \sum _{i=1}^{N} \text { Ln } \left( 1 + \frac{\mu S p_s(x_i)}{B p_b(x_i)}\right) &{} \textrm{if} \, \, \hat{\mu } < 0; \end{array}\right. } \nonumber \\ \end{aligned}$$
(5)

Since \(p_{s,b}(x)\) are typically not known, the base idea of our method in [36] is to replace these densities for the one-dimensional manifolds that can be obtained for signal and background from a machine-learning classifier. After training the classifier with a large and balanced data set of signal and background events, it can be obtained the classification score o(x) that maximizes the binary cross-entropy (BCE) and thus approaches [21, 54]

$$\begin{aligned} o(x) = \frac{p_{s}(x)}{p_{s}(x)+p_{b}(x)}, \end{aligned}$$
(6)

as the classifier approaches its optimal performance. The dimensionality reduction can be done by dealing with o(x) instead of x, using

$$\begin{aligned} p_{s}(x) \rightarrow \tilde{p}_{s}(o(x)), \quad \text {and}\quad p_{b}(x) \rightarrow \tilde{p}_{b}(o(x)), \end{aligned}$$
(7)

where \(\tilde{p}_{s,b}(o(x))\) are the distributions of o(x) for signal and background, obtained by evaluating the classifier on a set of pure signal or background events, respectively. Notice that this allows us to approximate both signal and background distributions individually, retaining the full information contained in both densities, without introducing any working point. These distributions are one-dimensional, and therefore can always be easily handled and incorporated into the test statistic in Eq. (5)

$$\begin{aligned} \tilde{q}_{\mu } = {\left\{ \begin{array}{ll} 0 &{} \textrm{if} \, \, \hat{\mu } > \mu \\ 2(\mu -\hat{\mu }) S \\ - 2 \sum _{i=1}^{N} \text { Ln } &{} \\ \left( \frac{B \tilde{p}_{b}(o(x_{i}))+\mu S \tilde{p}_{s}(o(x_{i}))}{B \tilde{p}_{b}(o(x_{i}))+\hat{\mu } S \tilde{p}_{s}(o(x_{i}))}\right) &{} \textrm{if} \, \, 0 \le \hat{\mu } \le \mu \\ 2\mu S - 2 \sum _{i=1}^{N} \text { Ln } \left( 1 + \frac{\mu S \tilde{p}_{s}(o(x_{i}))}{B \tilde{p}_{b}(o(x_{i}))}\right) &{} \textrm{if} \, \, \hat{\mu } < 0; \end{array}\right. } \nonumber \\ \end{aligned}$$
(8)

as well as into the condition on \(\hat{\mu }\) from Eq. (4)

$$\begin{aligned} \sum _{i=1}^{N}\frac{\tilde{p}_{s}(o(x_{i}))}{\hat{\mu }S\, \tilde{p}_{s}(o(x_{i})) +B\, \tilde{p}_{b}(o(x_{i}))} = 1 . \end{aligned}$$
(9)

The test statistic in Eq. (8) is estimated through a finite data set of N events and thus has a probability distribution conditioned on the true unknown signal strength \(\mu '\). For a given hypothesis described by the \(\mu '\) value, we can estimate numerically the \(\tilde{q}_{\mu }\) distribution. When the true hypothesis is assumed to be the background-only one (\(\mu '=0\)), the median expected exclusion significance \(\text {med }[Z_{\mu }| 0]\) is defined as

$$\begin{aligned} \text {med }\left[ Z_{\mu }|0\right] = \sqrt{\text {med }\left[ \tilde{q}_{\mu }|0\right] }, \end{aligned}$$
(10)

where we estimate the \(\tilde{q}_{\mu }\) distribution by generating a set of pseudo-experiments with background-only events. Then, to set upper limits to a certain confidence level, we select the lowest \(\mu \) which achieves the required median expected significance.

It is worth remarking that the output of the machine learning classifier, for a given set of events, gives us a sample of the desired PDFs \(\tilde{p}_{s,b}(o(x))\). Hence, to apply Eq. (8) we first need to extract the classifier posteriors. As these samples are one-dimensional, we can always compute binned PDFs, as was done in [36]. Binning the output variable is a typical procedure when using ML tools. Nevertheless, it is also possible to compute the PDFs through other parametric (such as mixture models [55]) or non-parametric methods (such as kernel density estimation (KDE) [47, 48] or neural density estimation [49]). In comparison with other density-estimation methods, KDE has the advantage of not assuming any functional form for the PDF, in contrast with the mixture of Gaussian methods, while keeping the computation and the interpretation simple, as opposed to neural density estimation methods. For this reason, in this work, we made extensive use of the KDE method,Footnote 1 through its scikit-learn implementation [57].

Given a set of N events that were previously classified by the machine learning as signal (background) events, the PDF estimated by the KDE method is defined as

$$\begin{aligned} p_{s,b}(o(x)) = \frac{1}{N}\sum _{i}^{N}\kappa _{\epsilon } \left[ o(x) - o(x_{i}) \right] \end{aligned}$$
(11)

where \(\kappa _{\epsilon }\) is a kernel function that depends on the “smoothing” scale, or bandwidth parameter \(\epsilon \). There are several different options for the kernel function. In this work, we used the Epanechnikov kernel [58] as it is known to be the most efficient kernel [49]. This kernel is defined as

$$\begin{aligned} \kappa _{\epsilon }(u)= {\left\{ \begin{array}{ll} \frac{1}{\epsilon }\frac{3}{4}\left( 1 - (u/\epsilon )^{2} \right) ,&{} \text {if } |u|\le \epsilon \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

It is important to remark that the “bandwidth” parameter \(\epsilon \) censors the degree of smoothness. Hence, a very low \(\epsilon \) will overfit the data, whereas a very high \(\epsilon \) will underfit it. In all our examples the \(\epsilon \) was selected through a grid search done using the GridSearchCV function inside the sklearn.model_selection python package. Given a value for \(\epsilon \), this function estimates the log-likelihood of the data using a 5-fold cross-validation strategy, i.e. the data set is split into 5 smaller sets, 4 are used to fit the KDE which is then validated on the remaining part of the data. Finally, the function gives as an output the \(\epsilon \) which maximizes the data likelihood. Also is worth remarking that although KDEs method suffers from the curse of dimensionality, we are applying such technique to the one-dimensional output of the machine learning classifier to avoid this problem.

Notice that the machine learning training (and hence the machine learning predictions) is a stochastic process that introduces small fluctuations around the optimal limit. These in turn could translate to non-smooth PDFs. To tackle this issue, the same procedure described above can be done when using an ensemble of N base classifiers trained on random subsets of the original data set, that average their individual predictions to form a final prediction. In this case, o(x) can simply be replaced by \(<o(x)> = \frac{1}{N}\sum _{i}^{N}o_{i}(x)\), which in turns gives smoother PDFs \(\tilde{p}_{s,b}(<o(x)>)\).

For completeness, we also introduce here the median exclusion significance estimation for the traditional binned likelihood (BL) method and the use of Asimov data sets [53], which will be used to compare our technique

$$\begin{aligned} {\textrm{med}}\left[ Z_\mu |0\right]= & {} \sqrt{\tilde{q}_{\mu }}\nonumber \\= & {} \left[ 2\sum _{d=1}^{D}\left( B_d\text { Ln }\left( \frac{B_d}{S_d+B_d}\right) +S_d\right) \right] ^{1/2}, \nonumber \\ \end{aligned}$$
(12)

where \(S_{d}\) and \(B_{d}\) are the expected number of signal and background events in each bin d. This approximation is very effective but runs into trouble when the dimension of the data grows, which is known as the curse of dimensionality since the number of data points required to reliably populate the bins scales exponentially with the dimension of the observables x. This is a non-existent problem in our method, which always reduces the original dimension to one as stated in Eq. (7), allowing the application of the BL method to the classifier output, as also done by experimental collaborations when using ML methods, as mentioned in Sect. 1.

3 Application examples

3.1 Known true PDFs: multivariate Gaussian distributions

To show the performance of the MLL method with KDE we first analyze toy models generated with multivariate Gaussian distributions of different dimensions,

$$\begin{aligned} \mathcal {N}_{dim}({\varvec{m}},\varvec{\Sigma })= & {} \frac{1}{(2\pi )^{n/2} |\varvec{\Sigma }|^{1/2}} \nonumber \\{} & {} exp\left( -\frac{1}{2}(x-m)^{T} \varvec{\Sigma }^{-1} (x-m) \right) , \end{aligned}$$
(13)

with mean \({\varvec{m}}\), and covariance matrix \(\varvec{\Sigma }\).

Fig. 1
figure 1

Results for the \(\mathcal {N}_{2}({\varvec{m}},\varvec{\Sigma })\) case. Top left panel: output of a single XGBoost classifier. Bottom left panel: averaged output of 10 XGBoost classifiers, defined as \(o(x) = \frac{1}{10}\sum _{i}^{10}o_{i}(x)\). Right panel: comparison between our trained classifier output and the mathematically optimal performance defined in Eq. (6)

We start with the simplest case, consisting of an abstract two-dimensional space \((x_1,x_2)\). Events are generated by Gaussian distributions \(\mathcal {N}_{2}({\varvec{m}},\varvec{\Sigma })\), with \({\varvec{m}} = +0.3 (-0.3)\) and no correlation, i.e., covariance matrices \(\varvec{\Sigma } = \mathbb {I}_{2\times 2}\) for S (B). We trained supervised per-event classifiers, XGBoost, with 1 M events per class (balanced data set), to distinguish S from B. The PDFs obtained from the classifier output, o(x), can be found in the top left panel of Fig. 1, for two new independent data sets of pure signal (blue) and pure background (red) events.

Since in this example we know the true underlying distributions in the original multidimensional space, we can test Eq. (6). In the right panel of Fig. 1 we show, in green dots, the output of one machine learning realization vs. the right-hand-side of Eq. (6) estimated with the real signal and background probability functions. We can observe that the classifier approaches the optimal limit, although there are some small fluctuations around the 1-to-1 line. These fluctuations are independent of the sampling of the data and come from the stochasticity inherent to any machine learning training process. In turn, these fluctuations translate to non-smooth PDFs for the machine learning output of background and signal events, as can be seen in the red and blue shadow histograms in the top left panel of Fig. 1.

As explained before, to solve this issue we can take advantage of ensembles, and build a variable from the average output of ten independent machine learning realizations, define as \(<o(x)> = \frac{1}{10}\sum _{i}^{10}o_{i}(x)\). It can be seen in the red and blue shadow histograms of the bottom left panel of Fig. 1 that, with this definition, the small fluctuations are washed out resulting in smoother PDFs. For completeness on both left panels, we also present the estimations of \(\tilde{p}_{s,b}(o(x))\) using the true PDFs (orange and purple solid lines), the KDE over the machine learning output o(x) (red and blue dashed curves of top left panel), and KDE over the average variable \(<o(x)>\) (red and blue dashed lines of the bottom left panel). On the one hand, it can be seen that, due to the flexibility of the KDE method, when fitting the machine learning output o(x) the resulting distributions follows the fluctuations around the true PDFs. On the other hand, it can be seen that, the KDE distributions obtained when fitting the average variable are smooth and closely approach the true PDFs.

Fig. 2
figure 2

Exclusion-limit significance for \(\mathcal {N}_{2}({\varvec{m}},\varvec{\Sigma })\) with \( {\varvec{m}}= +0.3 (-0.3)\) for S (B) and no correlation, for fixed \(\langle B \rangle =50\) k, and different signal strengths \(\langle S \rangle \). The red curves show the result of implementing the MLL+KDE method, while the blue and magenta curves represent the results obtained by applying the BL method to the classifier’s one-dimensional output and the original two-dimensional space, respectively. Dashed curves use the output of a single classifier, while solid lines use the averaged output of 10 classifiers. For comparison, we include the green solid curve with the results obtained using the true PDFs

In Fig. 2 we show the results for the MLL exclusion significance with KDE considering an example with a fixed background of \(\langle B \rangle =50\) k and different signal strengths. We also include the significance calculated using the true probability density functions in Eq. (3), and the results employing a binned Poisson log-likelihood of the original two-dimensional space \((x_1,x_2)\) with Eq. (12), which is possible to compute in this simple scenario. For completeness, we also include the results binning the one-dimensional ML output variable for obtaining the PDFs as in [36, 46]. As can be seen, since we are analyzing a simple example, the significances estimated with all the methods are indistinguishable from the ones estimated with the true PDFs, which is expected given the low dimensionality of the space.

We would like to highlight that the significance does not change significantly if we employ either o(x) computed with a single ML classifier, or with the averaged variable \(<o(x)>\) calculated ensembling several ML trainings. In addition, for both the MLL+KDE and the true PDF methods, the significance is estimated by generating a set of pseudo-experiments with a finite-size number of events. This introduces a small statistical fluctuation due to the randomness of the sample.

Fig. 3
figure 3

Exclusion-limit significance for \(\mathcal {N}_{dim}({\varvec{m}},\varvec{\Sigma })\) with \( {\varvec{m}}= +0.3 (-0.3)\) for S (B) and no correlation, as a function of the dim, for fixed \(\langle B \rangle =50\) k and \(\langle S \rangle =500\). The red curves show the result of implementing the MLL+KDE method, while the blue and brown curves represent the results obtained by applying the BL method to the classifier one-dimensional output for 10 linear and no-linear bins, respectively. Dashed curves use the output of a single classifier, while solid lines use the averaged output of 10 classifiers. For comparison, we include the green solid curve with the results obtained using the true PDFs

The advantage of the MLL+KDE method against traditional approaches appears when dealing with \(dim=n\), with \(n>2\). In Fig. 3 we present the exclusion significance for higher dimensional data generated with \(\mathcal {N}_{n}({\varvec{m}},\varvec{\Sigma })\), no correlation \(\varvec{\Sigma } = \mathbb {I}_{n\times n}\), and \({\varvec{m}} = +0.3 (-0.3)\) for S (B).

It is worth reminding that the binned-poisson likelihood method becomes intractable in the original high-dimensional space. Also, it is interesting to note that, the results with the MLL+KDE method approach the ones with the true generative functions for all the analyzed dimensions. It is important to highlight that the ML output is always one-dimensional regardless of the dimension of the input data and, hence, can always be easily binned. For completeness, we show in Fig. 3 the significances obtained by applying a BL method to the machine learning output with two different types of binning: a linear binning where all bins have the same size (in the one-dimensional output space), and a standard non-linear approach where all bins have the same number of background events (a binning strategy typically used by experimental collaborations since it avoids the presence of low-statistic bins in the background estimation, which in turns constraints systematic uncertainties). As can be seen, binning the output of the machine learning results in a non-negligible drop in significance. This can be understood as the binning introduce a loss of information due to a resolution effect. For this example, the linear binning turns out to be more effective for the BL method. In addition, and as in the \(n=2\) example, for MLL+KDE the use of an ensemble of machine learning realizations to obtain smoother PDFs does not change the results obtained with one single classifier. The same is verified when using the BL method over the average variable, although this behavior is expected since this method creates histograms from the distributions.

Fig. 4
figure 4

Exclusion-limit significance for \(\mathcal {N}_{dim}({\varvec{m}},\varvec{\Sigma })\) with \( {\varvec{m}}= +0.3 (-0.3)\) for S (B) and no correlation, as a function of the dim, for fixed \(\langle B \rangle =50\) k and \(\langle S \rangle =500\). The red solid curve shows the result of implementing the MLL+KDE method, while the green curve shows the results obtained using the true PDFs. Dashed color curves represent the results obtained by applying the BL method to the classifier’s one-dimensional output for different bin numbers. Left panel: linear binning. Right panel: non-linear binning (same number of B events per bin)

In the left and right panels of Fig. 4 we show the impact in the previous example of increasing the number of bins when applying BL to the classifier output, both for linear and non-linear bins, respectively. As stated before, linear binning proves to be a better sampling choice since its result approaches the ones obtained with the MLL+KDE method and with the true PDFs, when increasing the number of bins. Regarding the bins with the same number of background events, even though performance improves with more bins, the results are worst than its linear binning counterpart. This example shows the difficulties arising when trying to find an optimal binning that is not known a priori, and this highlights the advantage of using MLL+KDE, which although computationally expensive (when tuning the bandwidth parameter), sets an upper limit in the significance that can be achieved. It is also possible to automatically choose the optimal number of bins for histograms, as well as to tune the width of each bin, in a similar fashion as done for the \(\epsilon \) parameter in the KDE method. The results of this analysis can be found in Appendix A, where we show that the significances obtained optimizing the bin widths is similar to the ones assuming equal-sized bins, and hence, the MLL-KDE method still offers the best significances when compared to different binned multivariate approaches.

Fig. 5
figure 5

Exclusion-limit significance for \(\mathcal {N}_{dim}({\varvec{m}},\varvec{\Sigma })\) with \( {\varvec{m}}= +0.7 (-0.7)\) for S (B), as a function of the dim, for fixed \(\langle B \rangle =50\) k and \(\langle S \rangle =500\). The red solid curve shows the result of implementing the MLL+KDE method, while the green curve shows the results obtained using the true PDFs. Dashed color curves represent the results obtained by applying the BL method (with linear bins) to the classifier one-dimensional output for different bin numbers. Left panel: covariance matrices \(\varvec{\Sigma } = \mathbb {I}_{2\times 2}\) (no correlation). Right panel: covariance matrices \(\varvec{\Sigma }_{ij}=1\) if \(i=j\) and 0.5 if \(i\ne j\)

Fig. 6
figure 6

Left panel: ROC curves for the XGBoost classifiers associated with each possible data representation, trained to discriminate between \(H^{0}\) and \(t\bar{t}\) productions. Right panel: Exclusion limits for the search for a heavy neutral \(H^{0}\) with MLL+KDE method (red solid line corresponds to the averaged variable of 10 ML, and dotted orange line corresponds to 1 ML), and with the BL fit for different number of linear bins (dashed curves), for fixed \(\langle B \rangle =86\) k, and different signal strengths \(\langle S \rangle \)

Finally, in the right panel of Fig. 5 we show a case with correlation, \(\mathcal {N}_{n}({\varvec{m}},\varvec{\Sigma })\), with \({\varvec{m}} = +0.7 (-0.7)\) for S (B), and \(\varvec{\Sigma }_{ij}=1\) if \(i=j\) and 0.5 if \(i\ne j\). Comparing with the same example without correlation in the left panel of Fig. 5, the correlation makes the signal and background more difficult to distinguish, hence we obtain lower significance values, with MLL+KDE still offering the best performance.

Although these are toy models they allow us to understand the performance of MLL with KDE method over problems of different complexity and demonstrate its improvement with respect to the BL method applied to the classifier output. Particularly the MLL+KDE has a stable behavior when increasing the dimensionality of the input space, as well as when increasing the separation of the signal and background distributions on the original abstract variables. On the other hand, the BL method applied to the classifier output departs from the results obtained with the true PDFs as the number of dimensions and separation of signal and background samples increases. The number of bins to use is another limitation, non-existent in our method that uses a non-parametric technique for PDF extraction. We also tested that although the KDE method is sensible to the fluctuations inherent to the machine learning classifier output, the lack of smoothness of the extracted PDFs does not affect the estimation of the significance within our framework.

3.2 New exotic Higgs bosons at the LHC

In this section, we apply our method in the search for an exotic electrically-neutral heavy Higgs boson (\(H^{0}\)) at the LHC, which subsequently decays to a W boson and a heavy electrically-charged Higgs boson (\(H^{\pm }\)). This example was first analyzed with machine learning methods in Ref. [12]. The exotic \(H^{\pm }\) decays to another W boson and the SM Higgs boson (h). Taking into account only the dominant decay of the Higgs boson, the signal process is defined as

$$\begin{aligned} gg\rightarrow H^{0}\rightarrow W^{\mp }H^{\pm }\rightarrow W^{\mp }W^{\pm }h\rightarrow W^{\mp }W^{\pm }b\bar{b}. \end{aligned}$$
(14)

The background is therefore dominated by top-pair production, which also gives \(W^{\mp }W^{\pm }b\bar{b}\).

For our analysis, we use the same data presented in [12] that is publicly available at [59], which focus on the semi-leptonic decay mode of both background and signal events (one W boson decaying leptonically and the other one decaying hadronically), giving as final state \(\ell \nu jj b\bar{b}\). The data set consists of low-level variables (twenty-one in total, considering the momentum of each visible particle, the b-tagged information of all jets, and the reconstruction of the missing energy) and seven high-level variables (\(m_{jj}\), \(m_{jjj}\), \(m_{\ell \nu }\), \(m_{j\ell \nu }\), \(m_{b\bar{b}}\), \(m_{Wb\bar{b}}\) and \(m_{WWb\bar{b}}\)), expected to have higher discrimination power between signal and background (see [12] for more details). The signal benchmark case corresponds to a \(m_{H^{0}}=425\) GeV and \(m_{H^{\pm }}=325\) GeV.

For this example, we have trained three XGBoost classifiers with three different data representations: only low-level variables, only high-level variables, and combining both low and high-level features. For completeness we also add the result obtained when using an average variable obtained after ensembling 10 ML classifiers with all the input variables. In the left panel of Fig. 6 we show the ROC curves for the analysis, and as expected, the best performance was achieved using both low and high-level features (for both the averaged and non-averaged variable). These results are in agreement with the analysis performed in [12], obtained with different ML algorithms. In the following, we will work with the latter data representation to estimate the expected significance for the search for heavy Higgs.

To compute the expected background yield at the ATLAS detector at \(\sqrt{s}\) = 8 TeV and luminosity of 20 fb\(^{-1}\), \(B\simeq 86\) k, we simulated background events with MadGraph5_aMC@NLO 2.6 [60], using PYTHIA 8 [61] for showering and hadronization, and Delphes 3 [62] for fast detector simulation. We applied all the selection cuts in [12], and checked that the different kinematic distributions from our simulation are in agreement with the ones from the public data set. With the expected background prediction, we scan over the expected signal yield, S, to be agnostic regarding the coupling values of the model.

The exclusion significance for different \(\frac{S}{\sqrt{B}}\) ratios are shown in the right panel of Fig. 6. The results for the MLL+KDE methods do not yield significant differences and are shown as the red solid curve for the averaged variable of 10 ML and as dotted orange curve correspond for 1 ML. We also present as dashed curves the significance binning the one-dimensional ML output for different numbers of bins. We would like to remark that binning the original feature space is not possible due to its high dimensionality (twenty-one and seven low and high-level variables, respectively). Also for this collider example, it can be seen that the MLL+KDE method outperforms the results obtained with the binned likelihood procedure.

Table 1 Expected cross-section upper limit at 95\(\%\) C.L., considering ATLAS detector, \(\sqrt{s}\) = 8 TeV and luminosity of 20 fb\(^{-1}\) (\(B=86\) k). Background process and cuts as discussed in the main text

Since no excess has been found, we can compute the expected cross-section upper limit at 95\(\%\)C.L. for the new exotic Higgs bosons search, which corresponds to the value of \(\frac{S}{\sqrt{B}}\) that gives \(Z=1.64\). The results are presented in Table 1.

Table 2 Discovery significances assuming \(B=1000\) and \(S=100\). For comparison, we also include the results for the same case shown in [12] using a shallow neural network (NN) and a deep neural network (DN)

For completeness, and to compare with the results of [12], we show in Table 2 the discovery significance for MLL+KDE and BL methods. Notice that for this calculation we artificially set \(B=1000\) and \(S=100\) to directly compare our results with the ones in [12]. The significant improvement in this case is due to the use of the full ML output in both MLL+KDE and BL methods, while in Ref. [12] only a fraction of o(x) is used to define a signal enriched region with a working point.

3.3 SSM \(Z^{\prime }\) boson decaying into lepton pairs at the HL-LHC

In this section, we analyzed the performance of our method on a simple collider example, namely the search for an SSM \(Z^{\prime }\) boson decaying into lepton pairs at the HL-LHC. We generated sample events for signal and background with MadGraph5_aMC@NLO 2.6 [60], the showering and clustering were performed with PYTHIA 8 [61], and finally, the detector simulation was done with Delphes 3 [62]. For the SM background, we considered the Drell-Yan production of \(Z/\gamma {*}\) \(\rightarrow \) \(\ell \ell \), with \(\ell \) = e, \(\mu \), as in [51]. As in the previous examples, we trained a XGBoost classifier, with 1 M events per class, to distinguish S from B, for each \(Z^{\prime }\) mass value, \(m_{Z^{\prime }}\) = [2.5, 3.5, 4.5, 5, 5.5, 6.5, 7.5, 8.5] TeV, and final state (dielectron and dimuon). We use as input parameters the transverse momentum \(|p_T|\), the azimuthal angle \(\phi \), and the pseudorapidity \(\eta \) of the final state leptons in each channel, the kinematic variables that can be extracted directly from the Delphes 3 output file. Considering the expected background prediction, for each parameter point we scan over S to obtain the expected signal yield upper limit at 95\(\%\) C.L., corresponding to the value that gives \(Z=1.64\). Finally, we convert this yield to a cross section-upper limit that can be compared with the theoretical prediction.

We are employing the same setup and detector level cuts as in the work presented by the ATLAS Collaboration at \(\sqrt{s}=14\) TeV and 3 ab\(^{-1}\) [51], but we only generated signal and background events with dielectron and dimuon invariant masses above 1.8 TeV, and since we are dealing with a signal-enriched region and not the entire spectrum, the direct comparison with ATLAS projections for 95% CL exclusion limits is not strictly fair. This may enhance the performance of our classifier, since a \(Z^{\prime }\) signal would appear as an excess at high dilepton invariant masses. However, the power of our method can be shown in the left (right) panel of Fig. 7 for the dielectron (dimuon) channel when compared to the BL fit of the ML classifier output, which is on equal footing with the results for our method since it uses the same ML classifier. Unbinning signal and background posteriors provide more constraining exclusion limits for both final states, and as in the previous examples, there is no significant difference between MLL+KDE using the output of 1 ML or the averaged 10 ML.

Fig. 7
figure 7

Exclusion limits for the \(Z'_{SSM}\) with MLL+KDE method (red solid curve corresponds to the averaged variable of 10 ML, and dotted cyan curve to 1 ML), and with the BL fit of the ML output using 15 linear bins (blue curve). The shaded area in each case includes a naive estimation of the significance uncertainty caused by the mass variation on each point, according to the systematic uncertainty for the invariant mass estimated by ATLAS in [51]. Left panel: dielectron channel. Right panel: dimuon channel

4 Conclusions

The Machine-Learned Likelihoods method can be used to obtain discovery significances and exclusion limits for additive new physics scenarios. It uses a single classifier and its full one-dimensional output, which allows the estimation of the signal and background PDFs needed for statistical inference. In this paper, we extend the MLL method to obtain exclusion significances and improve its performance by using the KDE method to extract the corresponding PDFs from the ML output. We found that the small fluctuations of the machine learning output around the optimal value translate into non-smooth PDFs. We verify that this problem can be handled by averaging the output of several independent machine-learning realizations. But mostly, we show that these small fluctuations do not have a major impact on the final significance.

Although the binning of the classifier output is always possible, irrespective of the dimensionality of the original variables, we verify that computing the PDFs with a non-parametric method such as KDE to avoid the binning, enhances the performance. By analyzing toy models generated with Gaussian distribution of different dimensions (with and without correlation between signal and background), we showed that MLL with KDE outperforms the BL method (with both linear and non-linear bins) when dealing with high-dimensional data, while for low-dimensional data all the methods converge to the results obtained with the true PDFs. Although it is a well-known fact that almost all the benefits of unbinned approaches can be obtained with optimal binning, avoiding such a (usually cumbersome) process is one of the main advantages of our work, providing an automatic way of estimating the probability density distributions through the KDE implementation.

Finally, we test the MLL framework in two physical examples. We found that, as expected, MLL also improves the exclusion-limits results obtained in a realistic \(Z'\) analysis as well as in the search for exotic Higgs bosons at the LHC, surpassing the ones computed with the simple BL fit of the ML one-dimensional output.

Last but not least, we would like to remark that this new version of MLL with KDE does not include systematic uncertainties in the likelihood fit, which is necessary for any realistic search. As this is a highly non-trivial issue for unbinned methods, we leave the inclusion of nuisance parameters to the MLL framework for future work. Nevertheless, we also highlight that even though likelihoods without uncertainties can not be used in most experimental setups, it could be useful in specific scenarios where the nuisance parameters can be considered small, and in phenomenological analyses as proofs of concept.